2025-05-07T20:23:26.0508890Z Current runner version: '2.323.0'
2025-05-07T20:23:26.0515181Z Runner name: 'i-0bb11f79b54aad6c7'
2025-05-07T20:23:26.0516196Z Machine name: 'ip-10-0-16-208'
2025-05-07T20:23:26.0518887Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:26.0521158Z Contents: read
2025-05-07T20:23:26.0521668Z Metadata: read
2025-05-07T20:23:26.0522151Z Packages: read
2025-05-07T20:23:26.0522640Z ##[endgroup]
2025-05-07T20:23:26.0524496Z Secret source: None
2025-05-07T20:23:26.0525123Z Prepare workflow directory
2025-05-07T20:23:26.1453587Z Prepare all required actions
2025-05-07T20:23:26.1494009Z Getting action download info
2025-05-07T20:23:26.3905868Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:26.6496898Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:27.0214077Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:28.6102416Z Getting action download info
2025-05-07T20:23:28.7232991Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:28.9660430Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.8.0, 12.6.3, gcc)
2025-05-07T20:23:29.0166054Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:29.0273012Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:29.0284522Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:29.0285178Z ##[endgroup]
2025-05-07T20:23:30.2907887Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:30.2908331Z Instance Type: g5.4xlarge
2025-05-07T20:23:30.2908580Z AMI Name: unknown
2025-05-07T20:23:30.2950070Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:35.6403906Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:35.6404218Z with:
2025-05-07T20:23:35.6404444Z   submodules: true
2025-05-07T20:23:35.6404684Z   repository: pytorch/FBGEMM
2025-05-07T20:23:35.6405074Z   token: ***
2025-05-07T20:23:35.6405280Z   ssh-strict: true
2025-05-07T20:23:35.6405498Z   ssh-user: git
2025-05-07T20:23:35.6405721Z   persist-credentials: true
2025-05-07T20:23:35.6405978Z   clean: true
2025-05-07T20:23:35.6406211Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:35.6406481Z   fetch-depth: 1
2025-05-07T20:23:35.6406698Z   fetch-tags: false
2025-05-07T20:23:35.6406916Z   show-progress: true
2025-05-07T20:23:35.6407141Z   lfs: false
2025-05-07T20:23:35.6407354Z   set-safe-directory: true
2025-05-07T20:23:35.6407607Z env:
2025-05-07T20:23:35.6407822Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:35.6408132Z   BUILD_ENV: build_binary
2025-05-07T20:23:35.6408394Z   BUILD_TARGET: genai
2025-05-07T20:23:35.6408663Z   BUILD_VARIANT: cuda
2025-05-07T20:23:35.6408938Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:35.6409191Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:35.6409439Z ##[endgroup]
2025-05-07T20:23:35.7567880Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:35.7569082Z ##[group]Getting Git version info
2025-05-07T20:23:35.7569567Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:35.7570248Z [command]/usr/bin/git version
2025-05-07T20:23:35.7570535Z git version 2.47.1
2025-05-07T20:23:35.7574515Z ##[endgroup]
2025-05-07T20:23:35.7588338Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c149a0cb-4ab3-48f4-9c0f-4470b857a01b' before making global git config changes
2025-05-07T20:23:35.7589391Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:35.7602446Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:35.7643500Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:35.7668153Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:35.7686195Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:35.7690956Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:35.7715824Z refs/heads/main
2025-05-07T20:23:35.7725246Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:36.6350669Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.6402955Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:36.6430738Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:36.6436254Z ##[endgroup]
2025-05-07T20:23:36.6440186Z [command]/usr/bin/git submodule status
2025-05-07T20:23:36.6862698Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:36.6949609Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:36.7037535Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:36.7123645Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:36.7208909Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:36.7294894Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:36.7377423Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:36.7391487Z ##[group]Cleaning the repository
2025-05-07T20:23:36.7395818Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:36.7453762Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:36.7563293Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.7570840Z ##[endgroup]
2025-05-07T20:23:36.7572893Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:36.7577426Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:36.7607512Z ##[endgroup]
2025-05-07T20:23:36.7607891Z ##[group]Setting up auth
2025-05-07T20:23:36.7624280Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:36.7654370Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:36.7986794Z Entering 'external/asmjit'
2025-05-07T20:23:36.8052849Z Entering 'external/composable_kernel'
2025-05-07T20:23:36.8126729Z Entering 'external/cpuinfo'
2025-05-07T20:23:36.8192091Z Entering 'external/cutlass'
2025-05-07T20:23:36.8266320Z Entering 'external/googletest'
2025-05-07T20:23:36.8331303Z Entering 'external/hipify_torch'
2025-05-07T20:23:36.8396211Z Entering 'external/json'
2025-05-07T20:23:36.8481727Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:36.8512598Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:36.8847187Z Entering 'external/asmjit'
2025-05-07T20:23:36.8913486Z Entering 'external/composable_kernel'
2025-05-07T20:23:36.8986074Z Entering 'external/cpuinfo'
2025-05-07T20:23:36.9051508Z Entering 'external/cutlass'
2025-05-07T20:23:36.9127020Z Entering 'external/googletest'
2025-05-07T20:23:36.9192433Z Entering 'external/hipify_torch'
2025-05-07T20:23:36.9257482Z Entering 'external/json'
2025-05-07T20:23:36.9344780Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:36.9396211Z ##[endgroup]
2025-05-07T20:23:36.9396605Z ##[group]Fetching the repository
2025-05-07T20:23:36.9403529Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:37.1124177Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:37.1124818Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:37.1150981Z ##[endgroup]
2025-05-07T20:23:37.1151545Z ##[group]Determining the checkout info
2025-05-07T20:23:37.1152737Z ##[endgroup]
2025-05-07T20:23:37.1157406Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:37.1208607Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:37.1238245Z ##[group]Checking out the ref
2025-05-07T20:23:37.1243039Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:37.1368003Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.1372219Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:37.1382083Z ##[endgroup]
2025-05-07T20:23:37.1382676Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:37.1388852Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.1439505Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:37.1471627Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:37.1503712Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:37.1532549Z ##[endgroup]
2025-05-07T20:23:37.1533183Z ##[group]Fetching submodules
2025-05-07T20:23:37.1536700Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:37.1910776Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:37.1911242Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:37.1911665Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:37.1912046Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:37.1913878Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:37.1914301Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:37.1914692Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:37.1928456Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:37.2356145Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:37.2503947Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:37.2605698Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:37.2776233Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:37.2866514Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:37.2952329Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:37.3059433Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:37.3076898Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:37.3406582Z Entering 'external/asmjit'
2025-05-07T20:23:37.3438904Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.3472945Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.3505600Z Entering 'external/cutlass'
2025-05-07T20:23:37.3536947Z Entering 'external/googletest'
2025-05-07T20:23:37.3569114Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.3601914Z Entering 'external/json'
2025-05-07T20:23:37.3647364Z ##[endgroup]
2025-05-07T20:23:37.3647790Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:37.3653391Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:37.3984081Z Entering 'external/asmjit'
2025-05-07T20:23:37.4025924Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4026695Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4069917Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.4112583Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4113020Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4161831Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.4204440Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4204877Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4247317Z Entering 'external/cutlass'
2025-05-07T20:23:37.4290507Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4291223Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4342517Z Entering 'external/googletest'
2025-05-07T20:23:37.4386316Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4386758Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4429472Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.4472593Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4473016Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4514921Z Entering 'external/json'
2025-05-07T20:23:37.4557856Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4558305Z url.https://github.com/.insteadof
2025-05-07T20:23:37.4618662Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:37.4946867Z Entering 'external/asmjit'
2025-05-07T20:23:37.5008267Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:37.5011034Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.5072540Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:37.5075257Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.5136615Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:37.5139350Z Entering 'external/cutlass'
2025-05-07T20:23:37.5202189Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:37.5204868Z Entering 'external/googletest'
2025-05-07T20:23:37.5266367Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:37.5269713Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.5331130Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:37.5334018Z Entering 'external/json'
2025-05-07T20:23:37.5395227Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:37.5528364Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:37.5857003Z Entering 'external/asmjit'
2025-05-07T20:23:37.5889940Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.5922598Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.5954908Z Entering 'external/cutlass'
2025-05-07T20:23:37.5989235Z Entering 'external/googletest'
2025-05-07T20:23:37.6021011Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.6054080Z Entering 'external/json'
2025-05-07T20:23:37.6101382Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:37.6429716Z Entering 'external/asmjit'
2025-05-07T20:23:37.6462346Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.6498153Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.6529966Z Entering 'external/cutlass'
2025-05-07T20:23:37.6561999Z Entering 'external/googletest'
2025-05-07T20:23:37.6593820Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.6625471Z Entering 'external/json'
2025-05-07T20:23:37.6671510Z ##[endgroup]
2025-05-07T20:23:37.6713072Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:37.6740127Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:37.6918596Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:37.6918915Z with:
2025-05-07T20:23:37.6919159Z   name: fbgemm_genai_x86_gcc_py3.12_cu12.8.0.whl
2025-05-07T20:23:37.6919474Z   merge-multiple: false
2025-05-07T20:23:37.6919736Z   repository: pytorch/FBGEMM
2025-05-07T20:23:37.6919995Z   run-id: 14891846252
2025-05-07T20:23:37.6920229Z env:
2025-05-07T20:23:37.6920490Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:37.6920786Z   BUILD_ENV: build_binary
2025-05-07T20:23:37.6921032Z   BUILD_TARGET: genai
2025-05-07T20:23:37.6921255Z   BUILD_VARIANT: cuda
2025-05-07T20:23:37.6921491Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:37.6921741Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:37.6921984Z ##[endgroup]
2025-05-07T20:23:37.9255836Z Downloading single artifact
2025-05-07T20:23:38.0234489Z Preparing to download the following artifacts:
2025-05-07T20:23:38.0235316Z - fbgemm_genai_x86_gcc_py3.12_cu12.8.0.whl (ID: 3081407199, Size: 18498190, Expected Digest: sha256:44a8371d786eb18d4cfaf0c12983918cf9c0bfea6fa4b0e46e2bab9751f50039)
2025-05-07T20:23:38.1091147Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ece139bb-06c0-5836-80f5-9819333cc7e6/artifacts/32c0b958496f27864187ef499761b3d1022dfdf4e072683d135f40e372c7bc42.zip
2025-05-07T20:23:38.1093394Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:38.1736986Z (node:65593) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:38.1737961Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:38.4408684Z SHA256 digest of downloaded artifact is 44a8371d786eb18d4cfaf0c12983918cf9c0bfea6fa4b0e46e2bab9751f50039
2025-05-07T20:23:38.4409269Z Artifact download completed successfully.
2025-05-07T20:23:38.4409638Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:38.4414873Z Download artifact has finished successfully
2025-05-07T20:23:38.4669830Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:38.4670225Z with:
2025-05-07T20:23:38.4670453Z   driver-version: 570.133.07
2025-05-07T20:23:38.4670709Z env:
2025-05-07T20:23:38.4670938Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.4671240Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.4671494Z   BUILD_TARGET: genai
2025-05-07T20:23:38.4671729Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.4671965Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:38.4672227Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.4672465Z ##[endgroup]
2025-05-07T20:23:38.4766456Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:38.4766846Z with:
2025-05-07T20:23:38.4767054Z   timeout_minutes: 10
2025-05-07T20:23:38.4767405Z   max_attempts: 3
2025-05-07T20:23:38.4790492Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:38.4814164Z   retry_wait_seconds: 10
2025-05-07T20:23:38.4814433Z   polling_interval_seconds: 1
2025-05-07T20:23:38.4814702Z   warning_on_retry: true
2025-05-07T20:23:38.4814960Z   continue_on_error: false
2025-05-07T20:23:38.4815207Z env:
2025-05-07T20:23:38.4815430Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.4815869Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.4816263Z   BUILD_TARGET: genai
2025-05-07T20:23:38.4816594Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.4833385Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:38.4833676Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.4833920Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:38.4834159Z ##[endgroup]
2025-05-07T20:23:38.5642850Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:38.5644389Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:38.5644788Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:38.9018056Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:38.9018963Z No packages marked for removal.
2025-05-07T20:23:38.9081705Z Dependencies resolved.
2025-05-07T20:23:38.9091414Z Nothing to do.
2025-05-07T20:23:38.9092018Z Complete!
2025-05-07T20:23:38.9929550Z + install_nvidia_driver_common
2025-05-07T20:23:38.9935783Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:38.9936425Z + lspci
2025-05-07T20:23:38.9937975Z Before installing NVIDIA driver
2025-05-07T20:23:39.0121503Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:39.0122374Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:39.0122927Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:39.0123447Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:39.0123913Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:39.0124531Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:39.0125078Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:39.0125549Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:39.0125947Z + lsmod
2025-05-07T20:23:39.0170874Z Module                  Size  Used by
2025-05-07T20:23:39.0171254Z xt_nat                 16384  0
2025-05-07T20:23:39.0171618Z nvidia_modeset       1716224  0
2025-05-07T20:23:39.0171922Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:39.0172236Z wmi                    36864  1 video
2025-05-07T20:23:39.0172512Z nvidia_uvm           1884160  0
2025-05-07T20:23:39.0172807Z nvidia              11583488  2 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:39.0173163Z drm                   602112  1 nvidia
2025-05-07T20:23:39.0173467Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:39.0173822Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:39.0174170Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:39.0174456Z veth                   36864  0
2025-05-07T20:23:39.0174713Z xt_conntrack           16384  1
2025-05-07T20:23:39.0174967Z nft_chain_nat          16384  3
2025-05-07T20:23:39.0175225Z xt_MASQUERADE          20480  1
2025-05-07T20:23:39.0175534Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:39.0175871Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:39.0176509Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:39.0176972Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:39.0177284Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:39.0177580Z xfrm_user              57344  1
2025-05-07T20:23:39.0177848Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:39.0178148Z xt_addrtype            16384  2
2025-05-07T20:23:39.0178402Z nft_compat             20480  4
2025-05-07T20:23:39.0178713Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:39.0179130Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:39.0179502Z br_netfilter           36864  0
2025-05-07T20:23:39.0179786Z bridge                323584  1 br_netfilter
2025-05-07T20:23:39.0180093Z stp                    16384  1 bridge
2025-05-07T20:23:39.0180382Z llc                    16384  2 bridge,stp
2025-05-07T20:23:39.0180675Z overlay               167936  0
2025-05-07T20:23:39.0180931Z tls                   135168  0
2025-05-07T20:23:39.0181185Z nls_ascii              16384  1
2025-05-07T20:23:39.0181440Z nls_cp437              20480  1
2025-05-07T20:23:39.0181694Z vfat                   24576  1
2025-05-07T20:23:39.0181946Z fat                    86016  1 vfat
2025-05-07T20:23:39.0182213Z ena                   180224  0
2025-05-07T20:23:39.0182465Z sunrpc                696320  1
2025-05-07T20:23:39.0182729Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:39.0182993Z i8042                  45056  0
2025-05-07T20:23:39.0183251Z serio                  28672  3 i8042
2025-05-07T20:23:39.0183529Z button                 24576  0
2025-05-07T20:23:39.0183782Z sch_fq_codel           20480  17
2025-05-07T20:23:39.0184042Z dm_mod                188416  0
2025-05-07T20:23:39.0184296Z dax                    45056  1 dm_mod
2025-05-07T20:23:39.0184562Z loop                   36864  0
2025-05-07T20:23:39.0184811Z fuse                  163840  1
2025-05-07T20:23:39.0185156Z configfs               57344  1
2025-05-07T20:23:39.0185438Z dmi_sysfs              20480  0
2025-05-07T20:23:39.0185826Z crc32_pclmul           16384  0
2025-05-07T20:23:39.0186087Z crc32c_intel           24576  0
2025-05-07T20:23:39.0186343Z efivarfs               24576  1
2025-05-07T20:23:39.0186590Z + modinfo nvidia
2025-05-07T20:23:39.0190190Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:39.0190664Z import_ns:      DMA_BUF
2025-05-07T20:23:39.0190918Z alias:          char-major-195-*
2025-05-07T20:23:39.0191182Z version:        570.133.07
2025-05-07T20:23:39.0191429Z supported:      external
2025-05-07T20:23:39.0191679Z license:        Dual MIT/GPL
2025-05-07T20:23:39.0191960Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:39.0192325Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:39.0192767Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:39.0193111Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:39.0193459Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:39.0193805Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:39.0194123Z depends:        i2c-core,drm
2025-05-07T20:23:39.0194373Z retpoline:      Y
2025-05-07T20:23:39.0194594Z name:           nvidia
2025-05-07T20:23:39.0194948Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:39.0195408Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:39.0195919Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:39.0196337Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:39.0196648Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:39.0196944Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:39.0197262Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:39.0197567Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:39.0197994Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:39.0198360Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:39.0198749Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:39.0199074Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:39.0199381Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:39.0199693Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:39.0200049Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:39.0200445Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:39.0200826Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:39.0201236Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.0201637Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:39.0202055Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.0202469Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:39.0202803Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:39.0203175Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:39.0203542Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:39.0203872Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:39.0204193Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:39.0204528Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:39.0204849Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:39.0205152Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:39.0205504Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:39.0205867Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:39.0206192Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:39.0206525Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:39.0206868Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:39.0207197Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:39.0207539Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:39.0207966Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:39.0208262Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:39.0208580Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:39.0208903Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:39.0209215Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:39.0209535Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:39.0209892Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:39.0210243Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:39.0210562Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:39.0210907Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:39.0211246Z parm:           rm_firmware_active:charp
2025-05-07T20:23:39.0211544Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:39.0211782Z ++ command -v nvidia-smi
2025-05-07T20:23:39.0212046Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:39.0212306Z + set +e
2025-05-07T20:23:39.0212614Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:40.7191741Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:40.7192101Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:40.7192345Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:40.7192567Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:40.7192831Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:40.7193270Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:40.7193740Z + set -e
2025-05-07T20:23:40.7193930Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:40.7194314Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:40.7194774Z + post_install_nvidia_driver_common
2025-05-07T20:23:40.7197787Z + sudo modprobe nvidia
2025-05-07T20:23:40.8464003Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:40.8464559Z + lspci
2025-05-07T20:23:40.8464789Z After installing NVIDIA driver
2025-05-07T20:23:40.8585892Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:40.8586545Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:40.8587187Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:40.8587899Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:40.8588497Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:40.8589010Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:40.8589489Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:40.8589962Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:40.8590360Z + lsmod
2025-05-07T20:23:40.8618454Z Module                  Size  Used by
2025-05-07T20:23:40.8618867Z xt_nat                 16384  0
2025-05-07T20:23:40.8619272Z nvidia_modeset       1716224  0
2025-05-07T20:23:40.8619655Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:40.8620044Z wmi                    36864  1 video
2025-05-07T20:23:40.8620315Z nvidia_uvm           1884160  0
2025-05-07T20:23:40.8620692Z nvidia              11583488  2 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:40.8621158Z drm                   602112  1 nvidia
2025-05-07T20:23:40.8621567Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:40.8621974Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:40.8622316Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:40.8622601Z veth                   36864  0
2025-05-07T20:23:40.8622857Z xt_conntrack           16384  1
2025-05-07T20:23:40.8623106Z nft_chain_nat          16384  3
2025-05-07T20:23:40.8623368Z xt_MASQUERADE          20480  1
2025-05-07T20:23:40.8623679Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:40.8624019Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:40.8624689Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:40.8625151Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:40.8625461Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:40.8625747Z xfrm_user              57344  1
2025-05-07T20:23:40.8626017Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:40.8626302Z xt_addrtype            16384  2
2025-05-07T20:23:40.8626557Z nft_compat             20480  4
2025-05-07T20:23:40.8626858Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:40.8627264Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:40.8627632Z br_netfilter           36864  0
2025-05-07T20:23:40.8627910Z bridge                323584  1 br_netfilter
2025-05-07T20:23:40.8628203Z stp                    16384  1 bridge
2025-05-07T20:23:40.8628480Z llc                    16384  2 bridge,stp
2025-05-07T20:23:40.8628770Z overlay               167936  0
2025-05-07T20:23:40.8629026Z tls                   135168  0
2025-05-07T20:23:40.8629275Z nls_ascii              16384  1
2025-05-07T20:23:40.8629519Z nls_cp437              20480  1
2025-05-07T20:23:40.8629764Z vfat                   24576  1
2025-05-07T20:23:40.8630013Z fat                    86016  1 vfat
2025-05-07T20:23:40.8630268Z ena                   180224  0
2025-05-07T20:23:40.8630514Z sunrpc                696320  1
2025-05-07T20:23:40.8630768Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:40.8631025Z i8042                  45056  0
2025-05-07T20:23:40.8631302Z serio                  28672  3 i8042
2025-05-07T20:23:40.8631598Z button                 24576  0
2025-05-07T20:23:40.8631842Z sch_fq_codel           20480  17
2025-05-07T20:23:40.8632102Z dm_mod                188416  0
2025-05-07T20:23:40.8632358Z dax                    45056  1 dm_mod
2025-05-07T20:23:40.8632622Z loop                   36864  0
2025-05-07T20:23:40.8633009Z fuse                  163840  1
2025-05-07T20:23:40.8633257Z configfs               57344  1
2025-05-07T20:23:40.8633516Z dmi_sysfs              20480  0
2025-05-07T20:23:40.8633761Z crc32_pclmul           16384  0
2025-05-07T20:23:40.8634014Z crc32c_intel           24576  0
2025-05-07T20:23:40.8634267Z efivarfs               24576  1
2025-05-07T20:23:40.8634514Z + modinfo nvidia
2025-05-07T20:23:40.8634987Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:40.8635638Z import_ns:      DMA_BUF
2025-05-07T20:23:40.8636087Z alias:          char-major-195-*
2025-05-07T20:23:40.8636443Z version:        570.133.07
2025-05-07T20:23:40.8636777Z supported:      external
2025-05-07T20:23:40.8637096Z license:        Dual MIT/GPL
2025-05-07T20:23:40.8637377Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:40.8637717Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:40.8638034Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:40.8638351Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:40.8638693Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:40.8639027Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:40.8639329Z depends:        i2c-core,drm
2025-05-07T20:23:40.8639583Z retpoline:      Y
2025-05-07T20:23:40.8639801Z name:           nvidia
2025-05-07T20:23:40.8640157Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:40.8640766Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:40.8641384Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:40.8641862Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:40.8642163Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:40.8642462Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:40.8642778Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:40.8643073Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:40.8643382Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:40.8643862Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:40.8644249Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:40.8644597Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:40.8644899Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:40.8645204Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:40.8645555Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:40.8645950Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:40.8646325Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:40.8646732Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:40.8647128Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:40.8647551Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:40.8647959Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:40.8648296Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:40.8648661Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:40.8649033Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:40.8649367Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:40.8649691Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:40.8650022Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:40.8650343Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:40.8650645Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:40.8650992Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:40.8651397Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:40.8651721Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:40.8652052Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:40.8652395Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:40.8652820Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:40.8653164Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:40.8653491Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:40.8653770Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:40.8654094Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:40.8654414Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:40.8654726Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:40.8655048Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:40.8655403Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:40.8655755Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:40.8656071Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:40.8656418Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:40.8656756Z parm:           rm_firmware_active:charp
2025-05-07T20:23:40.8657027Z + set +e
2025-05-07T20:23:40.8657225Z + nvidia-smi
2025-05-07T20:23:42.2878690Z Wed May  7 20:23:42 2025       
2025-05-07T20:23:42.2879103Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.2879603Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:42.2880098Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.2880589Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:42.2881124Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:42.2881558Z |                                         |                        |               MIG M. |
2025-05-07T20:23:42.2881897Z |=========================================+========================+======================|
2025-05-07T20:23:42.2943169Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:42.2944151Z |  0%   30C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:42.2944552Z |                                         |                        |                  N/A |
2025-05-07T20:23:42.2944947Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.2945338Z                                                                                          
2025-05-07T20:23:42.2945727Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.2946154Z | Processes:                                                                              |
2025-05-07T20:23:42.2946588Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:42.2946995Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:42.2947344Z |=========================================================================================|
2025-05-07T20:23:42.2947788Z |  No running processes found                                                             |
2025-05-07T20:23:42.2948258Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.7120168Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:44.1289468Z NVIDIA A10G
2025-05-07T20:23:44.4017755Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:44.4018135Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:44.4018469Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:44.4018867Z + set -e
2025-05-07T20:23:44.4019085Z INFO: Ignoring allowed status 0
2025-05-07T20:23:44.4026993Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:44.4041033Z + sudo yum install -y yum-utils
2025-05-07T20:23:44.8366420Z Last metadata expiration check: 0:17:42 ago on Wed May  7 20:06:02 2025.
2025-05-07T20:23:44.8613286Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:44.9007123Z Dependencies resolved.
2025-05-07T20:23:44.9187369Z Nothing to do.
2025-05-07T20:23:44.9187796Z Complete!
2025-05-07T20:23:44.9575545Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:44.9576130Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:44.9576977Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.3555187Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.4123057Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:45.9934133Z nvidia-container-toolkit                         13 kB/s | 833  B     00:00    
2025-05-07T20:23:46.0191256Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:46.0592176Z Dependencies resolved.
2025-05-07T20:23:46.0769656Z ================================================================================
2025-05-07T20:23:46.0770283Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:46.0770809Z ================================================================================
2025-05-07T20:23:46.0771120Z Downgrading:
2025-05-07T20:23:46.0771527Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:46.0772397Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:46.0772784Z 
2025-05-07T20:23:46.0772923Z Transaction Summary
2025-05-07T20:23:46.0773312Z ================================================================================
2025-05-07T20:23:46.0773694Z Downgrade  2 Packages
2025-05-07T20:23:46.0773843Z 
2025-05-07T20:23:46.0773951Z Total download size: 6.8 M
2025-05-07T20:23:46.0774209Z Downloading Packages:
2025-05-07T20:23:46.1447463Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x  85 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:46.1562366Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64  16 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:46.1570701Z --------------------------------------------------------------------------------
2025-05-07T20:23:46.1573526Z Total                                            86 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:46.1576014Z Running transaction check
2025-05-07T20:23:46.1678201Z Transaction check succeeded.
2025-05-07T20:23:46.1678742Z Running transaction test
2025-05-07T20:23:46.1973146Z Transaction test succeeded.
2025-05-07T20:23:46.1975188Z Running transaction
2025-05-07T20:23:46.7462541Z   Preparing        :                                                        1/1 
2025-05-07T20:23:46.8518509Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:46.8551157Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:46.8797971Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:46.8798783Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:46.8905095Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:46.8930912Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:47.0832886Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:47.0833611Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:47.0834145Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:47.0834729Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:47.2167817Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:47.2168724Z WARNING:
2025-05-07T20:23:47.2169015Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:47.2169275Z 
2025-05-07T20:23:47.2169377Z   Available Versions:
2025-05-07T20:23:47.2169524Z 
2025-05-07T20:23:47.2169626Z   Version 2023.7.20250331:
2025-05-07T20:23:47.2169934Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:47.2170191Z 
2025-05-07T20:23:47.2170315Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:47.2170524Z 
2025-05-07T20:23:47.2170621Z     Release notes:
2025-05-07T20:23:47.2171030Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:47.2171404Z 
2025-05-07T20:23:47.2171495Z   Version 2023.7.20250414:
2025-05-07T20:23:47.2171808Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:47.2172056Z 
2025-05-07T20:23:47.2172174Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:47.2172382Z 
2025-05-07T20:23:47.2172468Z     Release notes:
2025-05-07T20:23:47.2172877Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:47.2173243Z 
2025-05-07T20:23:47.2173342Z   Version 2023.7.20250428:
2025-05-07T20:23:47.2173642Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:47.2173900Z 
2025-05-07T20:23:47.2174017Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:47.2174231Z 
2025-05-07T20:23:47.2174321Z     Release notes:
2025-05-07T20:23:47.2174713Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:47.2175072Z 
2025-05-07T20:23:47.2175186Z ================================================================================
2025-05-07T20:23:47.2532033Z  
2025-05-07T20:23:47.2532170Z 
2025-05-07T20:23:47.2532608Z Downgraded:
2025-05-07T20:23:47.2532975Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:47.2533555Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:47.2533918Z 
2025-05-07T20:23:47.2534011Z Complete!
2025-05-07T20:23:47.2984417Z + sudo systemctl restart docker
2025-05-07T20:23:52.4205044Z Wed May  7 20:23:52 2025       
2025-05-07T20:23:52.4205453Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.4205959Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:52.4206454Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:52.4206954Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:52.4207496Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:52.4207935Z |                                         |                        |               MIG M. |
2025-05-07T20:23:52.4208280Z |=========================================+========================+======================|
2025-05-07T20:23:52.4289891Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:52.4290445Z |  0%   30C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:52.4290829Z |                                         |                        |                  N/A |
2025-05-07T20:23:52.4291227Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:52.4291685Z                                                                                          
2025-05-07T20:23:52.4292410Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.4292874Z | Processes:                                                                              |
2025-05-07T20:23:52.4293320Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:52.4294008Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:52.4294357Z |=========================================================================================|
2025-05-07T20:23:52.4295145Z |  No running processes found                                                             |
2025-05-07T20:23:52.5749196Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:53.5402496Z Command completed after 1 attempt(s).
2025-05-07T20:23:53.5486726Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:53.5487182Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:53.5501512Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:53.5501866Z env:
2025-05-07T20:23:53.5502095Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:53.5502395Z   BUILD_ENV: build_binary
2025-05-07T20:23:53.5502641Z   BUILD_TARGET: genai
2025-05-07T20:23:53.5502887Z   BUILD_VARIANT: cuda
2025-05-07T20:23:53.5503121Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:53.5503383Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:53.5503684Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.5504012Z ##[endgroup]
2025-05-07T20:23:53.8853406Z ################################################################################
2025-05-07T20:23:53.8853776Z # Print System Info
2025-05-07T20:23:53.8853996Z #
2025-05-07T20:23:53.8869825Z # [2025-05-07T20:23:53.886Z] + print_system_info 
2025-05-07T20:23:53.8870185Z ################################################################################
2025-05-07T20:23:53.8870399Z 
2025-05-07T20:23:53.8870510Z ################################################################################
2025-05-07T20:23:53.8870840Z [INFO] Printing environment variables ...
2025-05-07T20:23:53.8871142Z + printenv
2025-05-07T20:23:53.8871256Z 
2025-05-07T20:23:53.8895466Z SHELL=/bin/bash
2025-05-07T20:23:53.8895979Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:53.8896371Z BUILD_VARIANT=cuda
2025-05-07T20:23:53.8896929Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2d420f38-63a6-49a3-894a-78ca8f969b19
2025-05-07T20:23:53.8897495Z GITHUB_ACTION=__run
2025-05-07T20:23:53.8897774Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.8898108Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:53.8898356Z RUNNER_NAME=i-0bb11f79b54aad6c7
2025-05-07T20:23:53.8898882Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:53.8899473Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:53.8899992Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:53.8900707Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:53.8901546Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:53.8902088Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:53.8902656Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:53.8903461Z ***
2025-05-07T20:23:53.8903850Z LOGNAME=ec2-user
2025-05-07T20:23:53.8904305Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:53.8904809Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:53.8905262Z GITHUB_ACTIONS=true
2025-05-07T20:23:53.8905700Z SYSTEMD_EXEC_PID=55408
2025-05-07T20:23:53.8906237Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:53.8907304Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:53.8908197Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:53.8908514Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:53.8908771Z RUNNER_OS=Linux
2025-05-07T20:23:53.8908995Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:53.8909239Z HOME=/home/ec2-user
2025-05-07T20:23:53.8909499Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:53.8909790Z LANG=C.UTF-8
2025-05-07T20:23:53.8910093Z RUNNER_TRACKING_ID=github_0bce55bd-12c2-4dec-a701-d9bdbd3e25ae
2025-05-07T20:23:53.8910453Z RUNNER_ARCH=X64
2025-05-07T20:23:53.8910736Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:53.8911345Z BUILD_TARGET=genai
2025-05-07T20:23:53.8911859Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_2d420f38-63a6-49a3-894a-78ca8f969b19
2025-05-07T20:23:53.8912716Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_2d420f38-63a6-49a3-894a-78ca8f969b19
2025-05-07T20:23:53.8913441Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:53.8914098Z INVOCATION_ID=fbf0150337c146ec88e18a11b2fcdd98
2025-05-07T20:23:53.8914417Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:53.8914684Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:53.8915254Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_2d420f38-63a6-49a3-894a-78ca8f969b19
2025-05-07T20:23:53.8915929Z BUILD_ENV=build_binary
2025-05-07T20:23:53.8916166Z GITHUB_ACTOR=q10
2025-05-07T20:23:53.8916383Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:53.8916616Z KERN_NAME_LC=linux
2025-05-07T20:23:53.8916841Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:53.8917140Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:53.8917469Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:53.8917716Z USER=ec2-user
2025-05-07T20:23:53.8917948Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:53.8918219Z SHLVL=1
2025-05-07T20:23:53.8918415Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:53.8918723Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:53.8919167Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:53.8919528Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:53.8919766Z KERN_NAME=Linux
2025-05-07T20:23:53.8919995Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:53.8920391Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:53.8920818Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:53.8921093Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:53.8921329Z JOURNAL_STREAM=8:93345
2025-05-07T20:23:53.8921641Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:53.8922005Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:53.8922306Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:53.8922634Z GITHUB_BASE_REF=main
2025-05-07T20:23:53.8922855Z CI=true
2025-05-07T20:23:53.8923062Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:53.8923348Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:53.8923628Z GITHUB_ACTION_REF=
2025-05-07T20:23:53.8923883Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:53.8924477Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_2d420f38-63a6-49a3-894a-78ca8f969b19
2025-05-07T20:23:53.8925056Z MACHINE_NAME=x86_64
2025-05-07T20:23:53.8925281Z _=/usr/bin/printenv
2025-05-07T20:23:53.8925426Z 
2025-05-07T20:23:53.8925540Z ################################################################################
2025-05-07T20:23:53.8925862Z [INFO] Print ldd version ...
2025-05-07T20:23:53.8926129Z + ldd --version
2025-05-07T20:23:53.8926258Z 
2025-05-07T20:23:53.8926349Z ldd (GNU libc) 2.34
2025-05-07T20:23:53.8926609Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:53.8927047Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:53.8927577Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:53.8928013Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:53.8928238Z 
2025-05-07T20:23:53.8928351Z ################################################################################
2025-05-07T20:23:53.8928686Z [INFO] Print CPU info ...
2025-05-07T20:23:53.8928953Z + nproc
2025-05-07T20:23:53.8929060Z 
2025-05-07T20:23:53.8945142Z 16
2025-05-07T20:23:53.8946647Z 
2025-05-07T20:23:53.8946887Z + lscpu
2025-05-07T20:23:53.8946999Z 
2025-05-07T20:23:53.9065052Z Architecture:                         x86_64
2025-05-07T20:23:53.9066184Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:53.9067408Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9068168Z Byte Order:                           Little Endian
2025-05-07T20:23:53.9068483Z CPU(s):                               16
2025-05-07T20:23:53.9068778Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:53.9069088Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:53.9069424Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:53.9069741Z CPU family:                           23
2025-05-07T20:23:53.9070164Z Model:                                49
2025-05-07T20:23:53.9070453Z Thread(s) per core:                   2
2025-05-07T20:23:53.9070741Z Core(s) per socket:                   8
2025-05-07T20:23:53.9071017Z Socket(s):                            1
2025-05-07T20:23:53.9071295Z Stepping:                             0
2025-05-07T20:23:53.9071594Z BogoMIPS:                             5599.99
2025-05-07T20:23:53.9073686Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9075864Z Hypervisor vendor:                    KVM
2025-05-07T20:23:53.9076175Z Virtualization type:                  full
2025-05-07T20:23:53.9076504Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.9076874Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.9077236Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:53.9077588Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:53.9077911Z NUMA node(s):                         1
2025-05-07T20:23:53.9078203Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:53.9078532Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:53.9078947Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:53.9079301Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:53.9079654Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:53.9080014Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:53.9080369Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:53.9080737Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:53.9081273Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:53.9081874Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:53.9082423Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:53.9083109Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:53.9083950Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:53.9084623Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:53.9084989Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:53.9085218Z 
2025-05-07T20:23:53.9085410Z + cat /proc/cpuinfo
2025-05-07T20:23:53.9085547Z 
2025-05-07T20:23:53.9085632Z processor	: 0
2025-05-07T20:23:53.9085850Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9086089Z cpu family	: 23
2025-05-07T20:23:53.9086293Z model		: 49
2025-05-07T20:23:53.9086501Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9086749Z stepping	: 0
2025-05-07T20:23:53.9086953Z microcode	: 0x830107f
2025-05-07T20:23:53.9087283Z cpu MHz		: 2530.842
2025-05-07T20:23:53.9087496Z cache size	: 512 KB
2025-05-07T20:23:53.9087709Z physical id	: 0
2025-05-07T20:23:53.9087920Z siblings	: 16
2025-05-07T20:23:53.9088119Z core id		: 0
2025-05-07T20:23:53.9088310Z cpu cores	: 8
2025-05-07T20:23:53.9088512Z apicid		: 0
2025-05-07T20:23:53.9088711Z initial apicid	: 0
2025-05-07T20:23:53.9088917Z fpu		: yes
2025-05-07T20:23:53.9089117Z fpu_exception	: yes
2025-05-07T20:23:53.9089332Z cpuid level	: 13
2025-05-07T20:23:53.9089532Z wp		: yes
2025-05-07T20:23:53.9091608Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9093870Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9094359Z bogomips	: 5599.99
2025-05-07T20:23:53.9094578Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9094810Z clflush size	: 64
2025-05-07T20:23:53.9095028Z cache_alignment	: 64
2025-05-07T20:23:53.9095296Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9095616Z power management:
2025-05-07T20:23:53.9095757Z 
2025-05-07T20:23:53.9095840Z processor	: 1
2025-05-07T20:23:53.9096052Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9096281Z cpu family	: 23
2025-05-07T20:23:53.9096487Z model		: 49
2025-05-07T20:23:53.9096692Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9096931Z stepping	: 0
2025-05-07T20:23:53.9097141Z microcode	: 0x830107f
2025-05-07T20:23:53.9097365Z cpu MHz		: 3310.621
2025-05-07T20:23:53.9097574Z cache size	: 512 KB
2025-05-07T20:23:53.9097796Z physical id	: 0
2025-05-07T20:23:53.9098005Z siblings	: 16
2025-05-07T20:23:53.9098201Z core id		: 1
2025-05-07T20:23:53.9098404Z cpu cores	: 8
2025-05-07T20:23:53.9098609Z apicid		: 2
2025-05-07T20:23:53.9098807Z initial apicid	: 2
2025-05-07T20:23:53.9099019Z fpu		: yes
2025-05-07T20:23:53.9099220Z fpu_exception	: yes
2025-05-07T20:23:53.9099457Z cpuid level	: 13
2025-05-07T20:23:53.9099663Z wp		: yes
2025-05-07T20:23:53.9101651Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9103905Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9150997Z bogomips	: 5599.99
2025-05-07T20:23:53.9151300Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9151605Z clflush size	: 64
2025-05-07T20:23:53.9151868Z cache_alignment	: 64
2025-05-07T20:23:53.9152132Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9152453Z power management:
2025-05-07T20:23:53.9152587Z 
2025-05-07T20:23:53.9152705Z processor	: 2
2025-05-07T20:23:53.9152917Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9153159Z cpu family	: 23
2025-05-07T20:23:53.9153367Z model		: 49
2025-05-07T20:23:53.9153648Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9153896Z stepping	: 0
2025-05-07T20:23:53.9154109Z microcode	: 0x830107f
2025-05-07T20:23:53.9154330Z cpu MHz		: 3298.770
2025-05-07T20:23:53.9154545Z cache size	: 512 KB
2025-05-07T20:23:53.9154764Z physical id	: 0
2025-05-07T20:23:53.9155139Z siblings	: 16
2025-05-07T20:23:53.9155347Z core id		: 2
2025-05-07T20:23:53.9155549Z cpu cores	: 8
2025-05-07T20:23:53.9155843Z apicid		: 4
2025-05-07T20:23:53.9156036Z initial apicid	: 4
2025-05-07T20:23:53.9156253Z fpu		: yes
2025-05-07T20:23:53.9156454Z fpu_exception	: yes
2025-05-07T20:23:53.9156666Z cpuid level	: 13
2025-05-07T20:23:53.9156874Z wp		: yes
2025-05-07T20:23:53.9159021Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9161292Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9161779Z bogomips	: 5599.99
2025-05-07T20:23:53.9162006Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9162244Z clflush size	: 64
2025-05-07T20:23:53.9162454Z cache_alignment	: 64
2025-05-07T20:23:53.9162720Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9163035Z power management:
2025-05-07T20:23:53.9163167Z 
2025-05-07T20:23:53.9163253Z processor	: 3
2025-05-07T20:23:53.9163471Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9163710Z cpu family	: 23
2025-05-07T20:23:53.9163916Z model		: 49
2025-05-07T20:23:53.9164118Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9164355Z stepping	: 0
2025-05-07T20:23:53.9164565Z microcode	: 0x830107f
2025-05-07T20:23:53.9164788Z cpu MHz		: 3296.831
2025-05-07T20:23:53.9165008Z cache size	: 512 KB
2025-05-07T20:23:53.9165224Z physical id	: 0
2025-05-07T20:23:53.9165788Z siblings	: 16
2025-05-07T20:23:53.9166032Z core id		: 3
2025-05-07T20:23:53.9166238Z cpu cores	: 8
2025-05-07T20:23:53.9166435Z apicid		: 6
2025-05-07T20:23:53.9166636Z initial apicid	: 6
2025-05-07T20:23:53.9166847Z fpu		: yes
2025-05-07T20:23:53.9167044Z fpu_exception	: yes
2025-05-07T20:23:53.9167259Z cpuid level	: 13
2025-05-07T20:23:53.9167466Z wp		: yes
2025-05-07T20:23:53.9169451Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9171695Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9172187Z bogomips	: 5599.99
2025-05-07T20:23:53.9172410Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9172646Z clflush size	: 64
2025-05-07T20:23:53.9172857Z cache_alignment	: 64
2025-05-07T20:23:53.9173124Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9173438Z power management:
2025-05-07T20:23:53.9173570Z 
2025-05-07T20:23:53.9173653Z processor	: 4
2025-05-07T20:23:53.9173867Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9174109Z cpu family	: 23
2025-05-07T20:23:53.9174314Z model		: 49
2025-05-07T20:23:53.9174530Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9174779Z stepping	: 0
2025-05-07T20:23:53.9174979Z microcode	: 0x830107f
2025-05-07T20:23:53.9175200Z cpu MHz		: 2176.939
2025-05-07T20:23:53.9175415Z cache size	: 512 KB
2025-05-07T20:23:53.9175627Z physical id	: 0
2025-05-07T20:23:53.9175838Z siblings	: 16
2025-05-07T20:23:53.9176040Z core id		: 4
2025-05-07T20:23:53.9176236Z cpu cores	: 8
2025-05-07T20:23:53.9176436Z apicid		: 8
2025-05-07T20:23:53.9176910Z initial apicid	: 8
2025-05-07T20:23:53.9177115Z fpu		: yes
2025-05-07T20:23:53.9177385Z fpu_exception	: yes
2025-05-07T20:23:53.9177610Z cpuid level	: 13
2025-05-07T20:23:53.9177812Z wp		: yes
2025-05-07T20:23:53.9181619Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9183883Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9184370Z bogomips	: 5599.99
2025-05-07T20:23:53.9184602Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9184835Z clflush size	: 64
2025-05-07T20:23:53.9185055Z cache_alignment	: 64
2025-05-07T20:23:53.9185322Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9185633Z power management:
2025-05-07T20:23:53.9185771Z 
2025-05-07T20:23:53.9185855Z processor	: 5
2025-05-07T20:23:53.9186071Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9186305Z cpu family	: 23
2025-05-07T20:23:53.9186512Z model		: 49
2025-05-07T20:23:53.9186719Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9186964Z stepping	: 0
2025-05-07T20:23:53.9187176Z microcode	: 0x830107f
2025-05-07T20:23:53.9187404Z cpu MHz		: 3297.663
2025-05-07T20:23:53.9187619Z cache size	: 512 KB
2025-05-07T20:23:53.9187830Z physical id	: 0
2025-05-07T20:23:53.9188043Z siblings	: 16
2025-05-07T20:23:53.9188248Z core id		: 5
2025-05-07T20:23:53.9188470Z cpu cores	: 8
2025-05-07T20:23:53.9188695Z apicid		: 10
2025-05-07T20:23:53.9188901Z initial apicid	: 10
2025-05-07T20:23:53.9189109Z fpu		: yes
2025-05-07T20:23:53.9189317Z fpu_exception	: yes
2025-05-07T20:23:53.9189537Z cpuid level	: 13
2025-05-07T20:23:53.9189740Z wp		: yes
2025-05-07T20:23:53.9191713Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9193954Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9194437Z bogomips	: 5599.99
2025-05-07T20:23:53.9194656Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9194897Z clflush size	: 64
2025-05-07T20:23:53.9195120Z cache_alignment	: 64
2025-05-07T20:23:53.9195383Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9195756Z power management:
2025-05-07T20:23:53.9195897Z 
2025-05-07T20:23:53.9195982Z processor	: 6
2025-05-07T20:23:53.9196203Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9196433Z cpu family	: 23
2025-05-07T20:23:53.9196643Z model		: 49
2025-05-07T20:23:53.9196855Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9197092Z stepping	: 0
2025-05-07T20:23:53.9197300Z microcode	: 0x830107f
2025-05-07T20:23:53.9197531Z cpu MHz		: 3303.680
2025-05-07T20:23:53.9197739Z cache size	: 512 KB
2025-05-07T20:23:53.9197954Z physical id	: 0
2025-05-07T20:23:53.9198160Z siblings	: 16
2025-05-07T20:23:53.9198354Z core id		: 6
2025-05-07T20:23:53.9198552Z cpu cores	: 8
2025-05-07T20:23:53.9198781Z apicid		: 12
2025-05-07T20:23:53.9199005Z initial apicid	: 12
2025-05-07T20:23:53.9199216Z fpu		: yes
2025-05-07T20:23:53.9199416Z fpu_exception	: yes
2025-05-07T20:23:53.9199713Z cpuid level	: 13
2025-05-07T20:23:53.9199926Z wp		: yes
2025-05-07T20:23:53.9201977Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9204217Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9204702Z bogomips	: 5599.99
2025-05-07T20:23:53.9204917Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9205164Z clflush size	: 64
2025-05-07T20:23:53.9205388Z cache_alignment	: 64
2025-05-07T20:23:53.9205665Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9205984Z power management:
2025-05-07T20:23:53.9206118Z 
2025-05-07T20:23:53.9206215Z processor	: 7
2025-05-07T20:23:53.9206429Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9206673Z cpu family	: 23
2025-05-07T20:23:53.9206885Z model		: 49
2025-05-07T20:23:53.9207087Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9207330Z stepping	: 0
2025-05-07T20:23:53.9207543Z microcode	: 0x830107f
2025-05-07T20:23:53.9207758Z cpu MHz		: 1796.866
2025-05-07T20:23:53.9207983Z cache size	: 512 KB
2025-05-07T20:23:53.9208199Z physical id	: 0
2025-05-07T20:23:53.9208409Z siblings	: 16
2025-05-07T20:23:53.9208608Z core id		: 7
2025-05-07T20:23:53.9208832Z cpu cores	: 8
2025-05-07T20:23:53.9209055Z apicid		: 14
2025-05-07T20:23:53.9209258Z initial apicid	: 14
2025-05-07T20:23:53.9209499Z fpu		: yes
2025-05-07T20:23:53.9209692Z fpu_exception	: yes
2025-05-07T20:23:53.9209904Z cpuid level	: 13
2025-05-07T20:23:53.9210115Z wp		: yes
2025-05-07T20:23:53.9212095Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9214337Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9214815Z bogomips	: 5599.99
2025-05-07T20:23:53.9215039Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9215279Z clflush size	: 64
2025-05-07T20:23:53.9215494Z cache_alignment	: 64
2025-05-07T20:23:53.9215763Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9216090Z power management:
2025-05-07T20:23:53.9216222Z 
2025-05-07T20:23:53.9216307Z processor	: 8
2025-05-07T20:23:53.9216522Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9216762Z cpu family	: 23
2025-05-07T20:23:53.9216963Z model		: 49
2025-05-07T20:23:53.9217169Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9217410Z stepping	: 0
2025-05-07T20:23:53.9217613Z microcode	: 0x830107f
2025-05-07T20:23:53.9217834Z cpu MHz		: 2216.105
2025-05-07T20:23:53.9218047Z cache size	: 512 KB
2025-05-07T20:23:53.9218263Z physical id	: 0
2025-05-07T20:23:53.9218470Z siblings	: 16
2025-05-07T20:23:53.9218671Z core id		: 0
2025-05-07T20:23:53.9218867Z cpu cores	: 8
2025-05-07T20:23:53.9219062Z apicid		: 1
2025-05-07T20:23:53.9219266Z initial apicid	: 1
2025-05-07T20:23:53.9219473Z fpu		: yes
2025-05-07T20:23:53.9219669Z fpu_exception	: yes
2025-05-07T20:23:53.9219883Z cpuid level	: 13
2025-05-07T20:23:53.9220090Z wp		: yes
2025-05-07T20:23:53.9222050Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9224488Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9224972Z bogomips	: 5599.99
2025-05-07T20:23:53.9225192Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9225426Z clflush size	: 64
2025-05-07T20:23:53.9225642Z cache_alignment	: 64
2025-05-07T20:23:53.9225910Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9226225Z power management:
2025-05-07T20:23:53.9226355Z 
2025-05-07T20:23:53.9226443Z processor	: 9
2025-05-07T20:23:53.9226652Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9226887Z cpu family	: 23
2025-05-07T20:23:53.9227083Z model		: 49
2025-05-07T20:23:53.9227287Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9227527Z stepping	: 0
2025-05-07T20:23:53.9227730Z microcode	: 0x830107f
2025-05-07T20:23:53.9227955Z cpu MHz		: 3305.186
2025-05-07T20:23:53.9228174Z cache size	: 512 KB
2025-05-07T20:23:53.9228384Z physical id	: 0
2025-05-07T20:23:53.9228594Z siblings	: 16
2025-05-07T20:23:53.9228792Z core id		: 1
2025-05-07T20:23:53.9228990Z cpu cores	: 8
2025-05-07T20:23:53.9229188Z apicid		: 3
2025-05-07T20:23:53.9229383Z initial apicid	: 3
2025-05-07T20:23:53.9229587Z fpu		: yes
2025-05-07T20:23:53.9229786Z fpu_exception	: yes
2025-05-07T20:23:53.9230000Z cpuid level	: 13
2025-05-07T20:23:53.9230204Z wp		: yes
2025-05-07T20:23:53.9232173Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9234422Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9234905Z bogomips	: 5599.99
2025-05-07T20:23:53.9235127Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9235357Z clflush size	: 64
2025-05-07T20:23:53.9235573Z cache_alignment	: 64
2025-05-07T20:23:53.9235912Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9236223Z power management:
2025-05-07T20:23:53.9236357Z 
2025-05-07T20:23:53.9236441Z processor	: 10
2025-05-07T20:23:53.9236656Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9236889Z cpu family	: 23
2025-05-07T20:23:53.9237094Z model		: 49
2025-05-07T20:23:53.9237298Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9237531Z stepping	: 0
2025-05-07T20:23:53.9237737Z microcode	: 0x830107f
2025-05-07T20:23:53.9237959Z cpu MHz		: 3301.776
2025-05-07T20:23:53.9238162Z cache size	: 512 KB
2025-05-07T20:23:53.9238372Z physical id	: 0
2025-05-07T20:23:53.9238577Z siblings	: 16
2025-05-07T20:23:53.9238776Z core id		: 2
2025-05-07T20:23:53.9238965Z cpu cores	: 8
2025-05-07T20:23:53.9239165Z apicid		: 5
2025-05-07T20:23:53.9239366Z initial apicid	: 5
2025-05-07T20:23:53.9239570Z fpu		: yes
2025-05-07T20:23:53.9239773Z fpu_exception	: yes
2025-05-07T20:23:53.9239990Z cpuid level	: 13
2025-05-07T20:23:53.9240192Z wp		: yes
2025-05-07T20:23:53.9242164Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9244500Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9244985Z bogomips	: 5599.99
2025-05-07T20:23:53.9245282Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9245517Z clflush size	: 64
2025-05-07T20:23:53.9245736Z cache_alignment	: 64
2025-05-07T20:23:53.9245999Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9246315Z power management:
2025-05-07T20:23:53.9246449Z 
2025-05-07T20:23:53.9246534Z processor	: 11
2025-05-07T20:23:53.9246749Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9246978Z cpu family	: 23
2025-05-07T20:23:53.9247187Z model		: 49
2025-05-07T20:23:53.9247391Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9247625Z stepping	: 0
2025-05-07T20:23:53.9247835Z microcode	: 0x830107f
2025-05-07T20:23:53.9248061Z cpu MHz		: 3290.391
2025-05-07T20:23:53.9248265Z cache size	: 512 KB
2025-05-07T20:23:53.9248478Z physical id	: 0
2025-05-07T20:23:53.9248687Z siblings	: 16
2025-05-07T20:23:53.9248883Z core id		: 3
2025-05-07T20:23:53.9249081Z cpu cores	: 8
2025-05-07T20:23:53.9249284Z apicid		: 7
2025-05-07T20:23:53.9249476Z initial apicid	: 7
2025-05-07T20:23:53.9249696Z fpu		: yes
2025-05-07T20:23:53.9249892Z fpu_exception	: yes
2025-05-07T20:23:53.9250102Z cpuid level	: 13
2025-05-07T20:23:53.9250311Z wp		: yes
2025-05-07T20:23:53.9252278Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9254521Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9255004Z bogomips	: 5599.99
2025-05-07T20:23:53.9255217Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9255456Z clflush size	: 64
2025-05-07T20:23:53.9255673Z cache_alignment	: 64
2025-05-07T20:23:53.9255936Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9256247Z power management:
2025-05-07T20:23:53.9256378Z 
2025-05-07T20:23:53.9256470Z processor	: 12
2025-05-07T20:23:53.9256680Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9256913Z cpu family	: 23
2025-05-07T20:23:53.9257118Z model		: 49
2025-05-07T20:23:53.9257318Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9257563Z stepping	: 0
2025-05-07T20:23:53.9257769Z microcode	: 0x830107f
2025-05-07T20:23:53.9257990Z cpu MHz		: 1848.723
2025-05-07T20:23:53.9258202Z cache size	: 512 KB
2025-05-07T20:23:53.9258417Z physical id	: 0
2025-05-07T20:23:53.9258645Z siblings	: 16
2025-05-07T20:23:53.9258874Z core id		: 4
2025-05-07T20:23:53.9259075Z cpu cores	: 8
2025-05-07T20:23:53.9259268Z apicid		: 9
2025-05-07T20:23:53.9259469Z initial apicid	: 9
2025-05-07T20:23:53.9259677Z fpu		: yes
2025-05-07T20:23:53.9259872Z fpu_exception	: yes
2025-05-07T20:23:53.9260089Z cpuid level	: 13
2025-05-07T20:23:53.9260296Z wp		: yes
2025-05-07T20:23:53.9262266Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9264610Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9265092Z bogomips	: 5599.99
2025-05-07T20:23:53.9265312Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9265839Z clflush size	: 64
2025-05-07T20:23:53.9266064Z cache_alignment	: 64
2025-05-07T20:23:53.9266478Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9266794Z power management:
2025-05-07T20:23:53.9266924Z 
2025-05-07T20:23:53.9267007Z processor	: 13
2025-05-07T20:23:53.9267221Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9267455Z cpu family	: 23
2025-05-07T20:23:53.9267654Z model		: 49
2025-05-07T20:23:53.9267857Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9268098Z stepping	: 0
2025-05-07T20:23:53.9268299Z microcode	: 0x830107f
2025-05-07T20:23:53.9268527Z cpu MHz		: 3318.688
2025-05-07T20:23:53.9268742Z cache size	: 512 KB
2025-05-07T20:23:53.9268974Z physical id	: 0
2025-05-07T20:23:53.9269205Z siblings	: 16
2025-05-07T20:23:53.9269406Z core id		: 5
2025-05-07T20:23:53.9269606Z cpu cores	: 8
2025-05-07T20:23:53.9269801Z apicid		: 11
2025-05-07T20:23:53.9270004Z initial apicid	: 11
2025-05-07T20:23:53.9270216Z fpu		: yes
2025-05-07T20:23:53.9270411Z fpu_exception	: yes
2025-05-07T20:23:53.9270628Z cpuid level	: 13
2025-05-07T20:23:53.9270832Z wp		: yes
2025-05-07T20:23:53.9272803Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9275051Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9275538Z bogomips	: 5599.99
2025-05-07T20:23:53.9275841Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9276070Z clflush size	: 64
2025-05-07T20:23:53.9276295Z cache_alignment	: 64
2025-05-07T20:23:53.9276559Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9276869Z power management:
2025-05-07T20:23:53.9277009Z 
2025-05-07T20:23:53.9277096Z processor	: 14
2025-05-07T20:23:53.9277308Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9277542Z cpu family	: 23
2025-05-07T20:23:53.9277741Z model		: 49
2025-05-07T20:23:53.9277947Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9278190Z stepping	: 0
2025-05-07T20:23:53.9278392Z microcode	: 0x830107f
2025-05-07T20:23:53.9278619Z cpu MHz		: 3306.727
2025-05-07T20:23:53.9278838Z cache size	: 512 KB
2025-05-07T20:23:53.9279045Z physical id	: 0
2025-05-07T20:23:53.9279252Z siblings	: 16
2025-05-07T20:23:53.9279452Z core id		: 6
2025-05-07T20:23:53.9279647Z cpu cores	: 8
2025-05-07T20:23:53.9279849Z apicid		: 13
2025-05-07T20:23:53.9280051Z initial apicid	: 13
2025-05-07T20:23:53.9280256Z fpu		: yes
2025-05-07T20:23:53.9280453Z fpu_exception	: yes
2025-05-07T20:23:53.9280668Z cpuid level	: 13
2025-05-07T20:23:53.9280869Z wp		: yes
2025-05-07T20:23:53.9282839Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9285206Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9285688Z bogomips	: 5599.99
2025-05-07T20:23:53.9285913Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9286141Z clflush size	: 64
2025-05-07T20:23:53.9286359Z cache_alignment	: 64
2025-05-07T20:23:53.9286625Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9286931Z power management:
2025-05-07T20:23:53.9287067Z 
2025-05-07T20:23:53.9287237Z processor	: 15
2025-05-07T20:23:53.9287456Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.9287684Z cpu family	: 23
2025-05-07T20:23:53.9287887Z model		: 49
2025-05-07T20:23:53.9288090Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.9288324Z stepping	: 0
2025-05-07T20:23:53.9288534Z microcode	: 0x830107f
2025-05-07T20:23:53.9288779Z cpu MHz		: 2022.452
2025-05-07T20:23:53.9289008Z cache size	: 512 KB
2025-05-07T20:23:53.9289220Z physical id	: 0
2025-05-07T20:23:53.9289433Z siblings	: 16
2025-05-07T20:23:53.9289627Z core id		: 7
2025-05-07T20:23:53.9289823Z cpu cores	: 8
2025-05-07T20:23:53.9290025Z apicid		: 15
2025-05-07T20:23:53.9290221Z initial apicid	: 15
2025-05-07T20:23:53.9290432Z fpu		: yes
2025-05-07T20:23:53.9290632Z fpu_exception	: yes
2025-05-07T20:23:53.9290840Z cpuid level	: 13
2025-05-07T20:23:53.9291048Z wp		: yes
2025-05-07T20:23:53.9293021Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.9301313Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.9301842Z bogomips	: 5599.99
2025-05-07T20:23:53.9302063Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.9302298Z clflush size	: 64
2025-05-07T20:23:53.9302518Z cache_alignment	: 64
2025-05-07T20:23:53.9302774Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.9303090Z power management:
2025-05-07T20:23:53.9303221Z 
2025-05-07T20:23:53.9303225Z 
2025-05-07T20:23:53.9303353Z ################################################################################
2025-05-07T20:23:53.9303664Z [INFO] Print PCI info ...
2025-05-07T20:23:53.9303909Z + lspci -v
2025-05-07T20:23:53.9304023Z 
2025-05-07T20:23:53.9304240Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:53.9304625Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:53.9304945Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:53.9305156Z 
2025-05-07T20:23:53.9305357Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:53.9305736Z 	Physical Slot: 1
2025-05-07T20:23:53.9305976Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.9306175Z 
2025-05-07T20:23:53.9306416Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:53.9306848Z 	Physical Slot: 1
2025-05-07T20:23:53.9307101Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:53.9307321Z 
2025-05-07T20:23:53.9307587Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:53.9308021Z 	Physical Slot: 3
2025-05-07T20:23:53.9308256Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.9308605Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.9308953Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:53.9309178Z 
2025-05-07T20:23:53.9309475Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.9310095Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.9310383Z 	Physical Slot: 4
2025-05-07T20:23:53.9310630Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:53.9311010Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.9311372Z 	Capabilities: <access denied>
2025-05-07T20:23:53.9311632Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.9311797Z 
2025-05-07T20:23:53.9312097Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.9312572Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.9312907Z 	Physical Slot: 5
2025-05-07T20:23:53.9313146Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.9313497Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.9313876Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.9314193Z 	Capabilities: <access denied>
2025-05-07T20:23:53.9314460Z 	Kernel driver in use: ena
2025-05-07T20:23:53.9314698Z 	Kernel modules: ena
2025-05-07T20:23:53.9314837Z 
2025-05-07T20:23:53.9315001Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:53.9315380Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:53.9315722Z 	Physical Slot: 30
2025-05-07T20:23:53.9315978Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:53.9316352Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:53.9316746Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:53.9317111Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:53.9317438Z 	Capabilities: <access denied>
2025-05-07T20:23:53.9317704Z 	Kernel driver in use: nvidia
2025-05-07T20:23:53.9317962Z 	Kernel modules: nvidia
2025-05-07T20:23:53.9318108Z 
2025-05-07T20:23:53.9318405Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.9318915Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.9319203Z 	Physical Slot: 31
2025-05-07T20:23:53.9319440Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.9319792Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.9320173Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:53.9320493Z 	Capabilities: <access denied>
2025-05-07T20:23:53.9320760Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.9320921Z 
2025-05-07T20:23:53.9320925Z 
2025-05-07T20:23:53.9321043Z ################################################################################
2025-05-07T20:23:53.9321377Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:53.9321652Z + uname -a
2025-05-07T20:23:53.9321769Z 
2025-05-07T20:23:53.9322172Z Linux ip-10-0-16-208.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:53.9322675Z 
2025-05-07T20:23:53.9322767Z + uname -m
2025-05-07T20:23:53.9322882Z 
2025-05-07T20:23:53.9322961Z x86_64
2025-05-07T20:23:53.9323069Z 
2025-05-07T20:23:53.9323154Z + cat /proc/version
2025-05-07T20:23:53.9323289Z 
2025-05-07T20:23:53.9323826Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:53.9324448Z 
2025-05-07T20:23:53.9324535Z + cat /etc/os-release
2025-05-07T20:23:53.9324675Z 
2025-05-07T20:23:53.9324768Z NAME="Amazon Linux"
2025-05-07T20:23:53.9324981Z VERSION="2023"
2025-05-07T20:23:53.9325182Z ID="amzn"
2025-05-07T20:23:53.9325368Z ID_LIKE="fedora"
2025-05-07T20:23:53.9325569Z VERSION_ID="2023"
2025-05-07T20:23:53.9325803Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:53.9326079Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:53.9326362Z ANSI_COLOR="0;33"
2025-05-07T20:23:53.9326606Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:53.9327082Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:53.9327502Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:53.9327909Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:53.9328346Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:53.9328712Z VENDOR_NAME="AWS"
2025-05-07T20:23:53.9328941Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:53.9329225Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:53.9329377Z 
2025-05-07T20:23:53.9329578Z ################################################################################
2025-05-07T20:23:53.9329878Z # Print EC2 Instance Info
2025-05-07T20:23:53.9330111Z #
2025-05-07T20:23:53.9330319Z # [2025-05-07T20:23:53.928Z] + print_ec2_info 
2025-05-07T20:23:53.9330626Z ################################################################################
2025-05-07T20:23:53.9330839Z 
2025-05-07T20:23:53.9408404Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:53.9525244Z instance-id: i-0bb11f79b54aad6c7
2025-05-07T20:23:53.9635432Z instance-type: g5.4xlarge
2025-05-07T20:23:53.9679735Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:53.9680117Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:53.9689803Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:53.9690165Z env:
2025-05-07T20:23:53.9690395Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:53.9690711Z   BUILD_ENV: build_binary
2025-05-07T20:23:53.9690969Z   BUILD_TARGET: genai
2025-05-07T20:23:53.9691212Z   BUILD_VARIANT: cuda
2025-05-07T20:23:53.9691453Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:53.9691725Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:53.9692045Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.9692384Z ##[endgroup]
2025-05-07T20:23:54.3023105Z ################################################################################
2025-05-07T20:23:54.3023506Z [INFO] Printing general display info ...
2025-05-07T20:23:54.3055235Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:54.4147227Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:54.4156103Z /usr/bin/sudo
2025-05-07T20:23:54.4167150Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:54.4178876Z /usr/bin/yum
2025-05-07T20:23:54.4180789Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:54.4202168Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:54.8444307Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:45 2025.
2025-05-07T20:23:54.9197583Z ================================================================================
2025-05-07T20:23:54.9198040Z WARNING:
2025-05-07T20:23:54.9198418Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:54.9198739Z 
2025-05-07T20:23:54.9198866Z   Available Versions:
2025-05-07T20:23:54.9199061Z 
2025-05-07T20:23:54.9199194Z   Version 2023.7.20250331:
2025-05-07T20:23:54.9199529Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:54.9199810Z 
2025-05-07T20:23:54.9199943Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:54.9200157Z 
2025-05-07T20:23:54.9200246Z     Release notes:
2025-05-07T20:23:54.9200647Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:54.9201008Z 
2025-05-07T20:23:54.9201097Z   Version 2023.7.20250414:
2025-05-07T20:23:54.9201401Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:54.9201645Z 
2025-05-07T20:23:54.9201762Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:54.9201968Z 
2025-05-07T20:23:54.9202058Z     Release notes:
2025-05-07T20:23:54.9202439Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:54.9202804Z 
2025-05-07T20:23:54.9202894Z   Version 2023.7.20250428:
2025-05-07T20:23:54.9203193Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:54.9203670Z 
2025-05-07T20:23:54.9203794Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:54.9204010Z 
2025-05-07T20:23:54.9204099Z     Release notes:
2025-05-07T20:23:54.9204484Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:54.9204839Z 
2025-05-07T20:23:54.9204956Z ================================================================================
2025-05-07T20:23:55.0340231Z Dependencies resolved.
2025-05-07T20:23:55.0624777Z ================================================================================
2025-05-07T20:23:55.0625342Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:55.0625845Z ================================================================================
2025-05-07T20:23:55.0626145Z Upgrading:
2025-05-07T20:23:55.0626502Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:55.0627088Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:55.0627512Z 
2025-05-07T20:23:55.0627846Z Transaction Summary
2025-05-07T20:23:55.0628204Z ================================================================================
2025-05-07T20:23:55.0628640Z Upgrade  2 Packages
2025-05-07T20:23:55.0628827Z 
2025-05-07T20:23:55.0628976Z Total download size: 6.9 M
2025-05-07T20:23:55.0629317Z Downloading Packages:
2025-05-07T20:23:55.1001347Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  34 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:55.1516452Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  65 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:55.1524727Z --------------------------------------------------------------------------------
2025-05-07T20:23:55.1527721Z Total                                            77 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:55.1530240Z Running transaction check
2025-05-07T20:23:55.1625736Z Transaction check succeeded.
2025-05-07T20:23:55.1626148Z Running transaction test
2025-05-07T20:23:55.1920532Z Transaction test succeeded.
2025-05-07T20:23:55.1923531Z Running transaction
2025-05-07T20:23:55.7435049Z   Preparing        :                                                        1/1 
2025-05-07T20:23:55.8494431Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:55.8520317Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.8741023Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.8741782Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.8853651Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.8883777Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:56.0477190Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:56.0477972Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:56.0478646Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:56.0479185Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:56.1944747Z ================================================================================
2025-05-07T20:23:56.1945222Z WARNING:
2025-05-07T20:23:56.1945470Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:56.1945756Z 
2025-05-07T20:23:56.1945891Z   Available Versions:
2025-05-07T20:23:56.1946039Z 
2025-05-07T20:23:56.1946132Z   Version 2023.7.20250331:
2025-05-07T20:23:56.1946439Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:56.1946689Z 
2025-05-07T20:23:56.1946816Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:56.1947024Z 
2025-05-07T20:23:56.1947109Z     Release notes:
2025-05-07T20:23:56.1947516Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:56.1948169Z 
2025-05-07T20:23:56.1948271Z   Version 2023.7.20250414:
2025-05-07T20:23:56.1948575Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:56.1948820Z 
2025-05-07T20:23:56.1948934Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:56.1949146Z 
2025-05-07T20:23:56.1949230Z     Release notes:
2025-05-07T20:23:56.1949660Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:56.1950028Z 
2025-05-07T20:23:56.1950127Z   Version 2023.7.20250428:
2025-05-07T20:23:56.1950424Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:56.1950675Z 
2025-05-07T20:23:56.1950788Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:56.1950993Z 
2025-05-07T20:23:56.1951087Z     Release notes:
2025-05-07T20:23:56.1951470Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:56.1951840Z 
2025-05-07T20:23:56.1952153Z ================================================================================
2025-05-07T20:23:56.2516503Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:56.2516854Z 
2025-05-07T20:23:56.2516947Z Upgraded:
2025-05-07T20:23:56.2517289Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:56.2517855Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:56.2518198Z 
2025-05-07T20:23:56.2518281Z Complete!
2025-05-07T20:23:56.2958796Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:56.2983788Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:56.7477116Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:45 2025.
2025-05-07T20:23:56.7719035Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:56.8123720Z Dependencies resolved.
2025-05-07T20:23:56.8302426Z ================================================================================
2025-05-07T20:23:56.8302931Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:56.8303426Z ================================================================================
2025-05-07T20:23:56.8303756Z Installing:
2025-05-07T20:23:56.8304050Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:56.8304322Z 
2025-05-07T20:23:56.8304419Z Transaction Summary
2025-05-07T20:23:56.8304664Z ================================================================================
2025-05-07T20:23:56.8304968Z Install  1 Package
2025-05-07T20:23:56.8305104Z 
2025-05-07T20:23:56.8305234Z Total download size: 319 k
2025-05-07T20:23:56.8306112Z Installed size: 837 k
2025-05-07T20:23:56.8307762Z Downloading Packages:
2025-05-07T20:23:56.9120120Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.0 MB/s | 319 kB     00:00    
2025-05-07T20:23:56.9126365Z --------------------------------------------------------------------------------
2025-05-07T20:23:56.9129567Z Total                                           3.8 MB/s | 319 kB     00:00     
2025-05-07T20:23:56.9283874Z Running transaction check
2025-05-07T20:23:56.9338580Z Transaction check succeeded.
2025-05-07T20:23:56.9339023Z Running transaction test
2025-05-07T20:23:56.9791919Z Transaction test succeeded.
2025-05-07T20:23:56.9795998Z Running transaction
2025-05-07T20:23:57.0792731Z   Preparing        :                                                        1/1 
2025-05-07T20:23:57.1277137Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:57.3055995Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:57.4227102Z ================================================================================
2025-05-07T20:23:57.4227674Z WARNING:
2025-05-07T20:23:57.4228049Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:57.4228771Z 
2025-05-07T20:23:57.4228916Z   Available Versions:
2025-05-07T20:23:57.4229160Z 
2025-05-07T20:23:57.4229290Z   Version 2023.7.20250331:
2025-05-07T20:23:57.4229735Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:57.4230089Z 
2025-05-07T20:23:57.4230268Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:57.4230569Z 
2025-05-07T20:23:57.4230696Z     Release notes:
2025-05-07T20:23:57.4231283Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:57.4231813Z 
2025-05-07T20:23:57.4231960Z   Version 2023.7.20250414:
2025-05-07T20:23:57.4232410Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:57.4232778Z 
2025-05-07T20:23:57.4232939Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:57.4233245Z 
2025-05-07T20:23:57.4233384Z     Release notes:
2025-05-07T20:23:57.4233918Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:57.4234436Z 
2025-05-07T20:23:57.4234792Z   Version 2023.7.20250428:
2025-05-07T20:23:57.4235233Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:57.4235675Z 
2025-05-07T20:23:57.4235858Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:57.4236174Z 
2025-05-07T20:23:57.4236301Z     Release notes:
2025-05-07T20:23:57.4236890Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:57.4237447Z 
2025-05-07T20:23:57.4237618Z ================================================================================
2025-05-07T20:23:57.4574396Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:57.4574860Z 
2025-05-07T20:23:57.4574992Z Installed:
2025-05-07T20:23:57.4575422Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:57.4575830Z 
2025-05-07T20:23:57.4575964Z Complete!
2025-05-07T20:23:57.5032982Z + hostname
2025-05-07T20:23:57.5033148Z 
2025-05-07T20:23:57.5047487Z ip-10-0-16-208.ec2.internal
2025-05-07T20:23:57.5048843Z 
2025-05-07T20:23:57.5049381Z + sudo lshw -C display
2025-05-07T20:23:57.5049599Z 
2025-05-07T20:23:58.0775715Z   *-display:0 UNCLAIMED
2025-05-07T20:23:58.0776118Z        description: VGA compatible controller
2025-05-07T20:23:58.0776445Z        product: Amazon.com, Inc.
2025-05-07T20:23:58.0776727Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:58.0776987Z        physical id: 3
2025-05-07T20:23:58.0777226Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:58.0777479Z        version: 00
2025-05-07T20:23:58.0777695Z        width: 32 bits
2025-05-07T20:23:58.0777916Z        clock: 33MHz
2025-05-07T20:23:58.0778159Z        capabilities: vga_controller bus_master
2025-05-07T20:23:58.0778496Z        configuration: latency=0
2025-05-07T20:23:58.0785978Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:58.0786331Z   *-display:1
2025-05-07T20:23:58.0786583Z        description: 3D controller
2025-05-07T20:23:58.0786877Z        product: GA102GL [A10G]
2025-05-07T20:23:58.0787153Z        vendor: NVIDIA Corporation
2025-05-07T20:23:58.0787421Z        physical id: 1e
2025-05-07T20:23:58.0787667Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:58.0787932Z        version: a1
2025-05-07T20:23:58.0788149Z        width: 64 bits
2025-05-07T20:23:58.0788377Z        clock: 33MHz
2025-05-07T20:23:58.0788676Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:58.0789051Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:58.0789672Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:58.0816111Z 
2025-05-07T20:23:58.0816308Z ################################################################################
2025-05-07T20:23:58.0816728Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:58.0950348Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:58.1120166Z Wed May  7 20:23:58 2025       
2025-05-07T20:23:58.1120705Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:58.1121384Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:58.1121895Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:58.1122395Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:58.1122924Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:58.1123356Z |                                         |                        |               MIG M. |
2025-05-07T20:23:58.1123692Z |=========================================+========================+======================|
2025-05-07T20:23:58.1201478Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:58.1202341Z |  0%   30C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:58.1202881Z |                                         |                        |                  N/A |
2025-05-07T20:23:58.1203388Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:58.1203791Z                                                                                          
2025-05-07T20:23:58.1204180Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:58.1204603Z | Processes:                                                                              |
2025-05-07T20:23:58.1205040Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:58.1205454Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:58.1205830Z |=========================================================================================|
2025-05-07T20:23:58.1206415Z |  No running processes found                                                             |
2025-05-07T20:23:58.1207054Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:58.2594308Z ################################################################################
2025-05-07T20:23:58.2594792Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:58.2734229Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:58.2735191Z [CHECK] rocminfo not found
2025-05-07T20:23:58.2744077Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:58.2745210Z [CHECK] rocm-smi not found
2025-05-07T20:23:58.2780092Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:58.2780526Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:58.2793004Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:58.2793363Z env:
2025-05-07T20:23:58.2793593Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:58.2793907Z   BUILD_ENV: build_binary
2025-05-07T20:23:58.2794164Z   BUILD_TARGET: genai
2025-05-07T20:23:58.2794394Z   BUILD_VARIANT: cuda
2025-05-07T20:23:58.2794645Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:58.2794910Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:58.2795218Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:58.2795558Z ##[endgroup]
2025-05-07T20:23:58.6133580Z ################################################################################
2025-05-07T20:23:58.6133938Z # Setup Miniconda
2025-05-07T20:23:58.6134314Z #
2025-05-07T20:23:58.6148358Z # [2025-05-07T20:23:58.614Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:58.6148864Z ################################################################################
2025-05-07T20:23:58.6149093Z 
2025-05-07T20:23:58.6163208Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:58.7045094Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:58.7045462Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:58.7045660Z 
2025-05-07T20:23:58.7062741Z 
2025-05-07T20:23:58.7063081Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:58.7086238Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:59.6646494Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:59.6646876Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:59.6647131Z 
2025-05-07T20:23:59.6789655Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:24:00.1189215Z Unpacking payload ...
2025-05-07T20:24:00.6380150Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:01.4470140Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:03.5509097Z 
2025-05-07T20:24:03.5509447Z Installing base environment...
2025-05-07T20:24:03.5509678Z 
2025-05-07T20:24:04.6384500Z Preparing transaction: ...working... done
2025-05-07T20:24:07.5491072Z Executing transaction: ...working... done
2025-05-07T20:24:08.2216355Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:08.3124739Z installation finished.
2025-05-07T20:24:08.3133799Z 
2025-05-07T20:24:08.3133996Z + rm -f miniconda.sh
2025-05-07T20:24:08.3134186Z 
2025-05-07T20:24:08.4006546Z 
2025-05-07T20:24:08.4006946Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:24:08.4007314Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:24:08.4007557Z 
2025-05-07T20:24:08.7666875Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:24:08.7667255Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:24:08.7667648Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:24:08.7668157Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:24:08.7668522Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:24:08.7668913Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:24:08.7669342Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:24:08.7669780Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:24:08.7670235Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:24:08.7671009Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:24:08.7671531Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:24:08.7671911Z modified      /home/ec2-user/.bashrc
2025-05-07T20:24:08.7672102Z 
2025-05-07T20:24:08.7672299Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:24:08.7672605Z 
2025-05-07T20:24:08.8318193Z 
2025-05-07T20:24:08.8318696Z + . /home/ec2-user/.bashrc
2025-05-07T20:24:08.8318954Z 
2025-05-07T20:24:09.6748086Z 
2025-05-07T20:24:09.6748815Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:09.6770599Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:23.0756111Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:24.6302584Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:24:24.7269170Z 
2025-05-07T20:24:24.7269807Z ## Package Plan ##
2025-05-07T20:24:24.7270259Z 
2025-05-07T20:24:24.7270559Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:24.7271031Z 
2025-05-07T20:24:24.7271223Z   added / updated specs:
2025-05-07T20:24:24.7271741Z     - conda-libmamba-solver
2025-05-07T20:24:24.7272240Z     - libarchive
2025-05-07T20:24:24.7272640Z     - libmamba
2025-05-07T20:24:24.7273038Z     - libmambapy
2025-05-07T20:24:24.7273283Z 
2025-05-07T20:24:24.7273291Z 
2025-05-07T20:24:24.7273577Z The following packages will be downloaded:
2025-05-07T20:24:24.7274002Z 
2025-05-07T20:24:24.7274236Z     package                    |            build
2025-05-07T20:24:24.7274858Z     ---------------------------|-----------------
2025-05-07T20:24:24.7275498Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:24.7276092Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:24.7276514Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:24.7276982Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:24.7277426Z     ------------------------------------------------------------
2025-05-07T20:24:24.7277764Z                                            Total:         1.4 MB
2025-05-07T20:24:24.7277970Z 
2025-05-07T20:24:24.7278082Z The following packages will be UPDATED:
2025-05-07T20:24:24.7278290Z 
2025-05-07T20:24:24.7283189Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:24.7283965Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:24.7284342Z 
2025-05-07T20:24:24.7284568Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:24.7284880Z 
2025-05-07T20:24:24.7285193Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:24.7285983Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:24.7286463Z 
2025-05-07T20:24:24.7286468Z 
2025-05-07T20:24:24.7286472Z 
2025-05-07T20:24:24.7286614Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:24.7286989Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:24.7287208Z 
2025-05-07T20:24:24.7287952Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:24.7288194Z 
2025-05-07T20:24:24.7293443Z 
2025-05-07T20:24:24.7305772Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:24.7306093Z 
2025-05-07T20:24:24.7306098Z 
2025-05-07T20:24:24.7306103Z 
2025-05-07T20:24:24.7980025Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:24.7981914Z 
2025-05-07T20:24:24.8043678Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.8142749Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.8142981Z 
2025-05-07T20:24:24.8143318Z 
2025-05-07T20:24:24.8143985Z 
2025-05-07T20:24:24.8229736Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.8231210Z 
2025-05-07T20:24:24.8285656Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.8285913Z 
2025-05-07T20:24:24.8285917Z 
2025-05-07T20:24:24.8510619Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.8510917Z 
2025-05-07T20:24:24.8510922Z 
2025-05-07T20:24:24.8512803Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.8513056Z 
2025-05-07T20:24:24.8513067Z 
2025-05-07T20:24:24.8518077Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.8518336Z 
2025-05-07T20:24:24.8518340Z 
2025-05-07T20:24:24.8518344Z 
2025-05-07T20:24:24.8520124Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.8520398Z 
2025-05-07T20:24:24.8520401Z 
2025-05-07T20:24:24.8520474Z 
2025-05-07T20:24:24.9503915Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.9504501Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.9510503Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.9511004Z                                                      
2025-05-07T20:24:24.9511296Z 
2025-05-07T20:24:24.9511594Z                                                      [A
2025-05-07T20:24:24.9511914Z 
2025-05-07T20:24:24.9511921Z 
2025-05-07T20:24:24.9512186Z                                                      [A[A
2025-05-07T20:24:24.9512481Z 
2025-05-07T20:24:24.9512487Z 
2025-05-07T20:24:24.9512506Z 
2025-05-07T20:24:24.9512714Z                                                      [A[A[A done
2025-05-07T20:24:25.0518309Z Preparing transaction: / done
2025-05-07T20:24:25.1524202Z Verifying transaction: \ done
2025-05-07T20:24:26.4542453Z Executing transaction: / - \ | / - \ | / - \ | / done
2025-05-07T20:24:28.1789617Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:28.1815095Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:29.1425303Z Channels:
2025-05-07T20:24:29.1425738Z  - defaults
2025-05-07T20:24:29.1426152Z Platform: linux-64
2025-05-07T20:24:30.3118224Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:30.4283935Z Solving environment: - \ Channels:
2025-05-07T20:24:30.4284503Z  - defaults
2025-05-07T20:24:30.4284915Z Platform: linux-64
2025-05-07T20:24:30.7223840Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:30.9328940Z Solving environment: - \ | / done
2025-05-07T20:24:31.0147664Z done
2025-05-07T20:24:31.0800611Z 
2025-05-07T20:24:31.0801022Z ## Package Plan ##
2025-05-07T20:24:31.0801186Z 
2025-05-07T20:24:31.0801337Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:31.0801588Z 
2025-05-07T20:24:31.0801686Z   added / updated specs:
2025-05-07T20:24:31.0801932Z     - conda
2025-05-07T20:24:31.0802047Z 
2025-05-07T20:24:31.0802052Z 
2025-05-07T20:24:31.0802176Z The following packages will be downloaded:
2025-05-07T20:24:31.0802388Z 
2025-05-07T20:24:31.0802508Z     package                    |            build
2025-05-07T20:24:31.0802823Z     ---------------------------|-----------------
2025-05-07T20:24:31.0803169Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:31.0803800Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:31.0804177Z     ------------------------------------------------------------
2025-05-07T20:24:31.0804525Z                                            Total:         1.4 MB
2025-05-07T20:24:31.0804733Z 
2025-05-07T20:24:31.0804853Z The following packages will be UPDATED:
2025-05-07T20:24:31.0805056Z 
2025-05-07T20:24:31.0805360Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:31.0805861Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:31.0806114Z 
2025-05-07T20:24:31.0806118Z 
2025-05-07T20:24:31.0806122Z 
2025-05-07T20:24:31.0806262Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:31.0806631Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:31.0806847Z 
2025-05-07T20:24:31.1152871Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:31.1154035Z 
2025-05-07T20:24:31.1521348Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.3778749Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.3779405Z 
2025-05-07T20:24:31.3780695Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.3781060Z 
2025-05-07T20:24:31.3926134Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.3926657Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.3931115Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.3931436Z                                                      
2025-05-07T20:24:31.3931631Z 
2025-05-07T20:24:31.3931865Z                                                      [A done
2025-05-07T20:24:31.4934907Z Preparing transaction: \ done
2025-05-07T20:24:31.5940359Z Verifying transaction: / done
2025-05-07T20:24:33.5966999Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:34.2026511Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:34.2030665Z + conda clean --packages --tarball -y
2025-05-07T20:24:34.2030903Z 
2025-05-07T20:24:35.2046703Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:35.2047181Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:35.2664706Z 
2025-05-07T20:24:35.2673441Z + conda clean --all -y
2025-05-07T20:24:35.2673623Z 
2025-05-07T20:24:35.8174798Z There are no unused tarball(s) to remove.
2025-05-07T20:24:35.8175160Z Will remove 1 index cache(s).
2025-05-07T20:24:35.8175444Z There are no unused package(s) to remove.
2025-05-07T20:24:35.8175765Z There are no tempfile(s) to remove.
2025-05-07T20:24:35.8176074Z There are no logfile(s) to remove.
2025-05-07T20:24:35.8818624Z 
2025-05-07T20:24:35.8822322Z + conda info
2025-05-07T20:24:35.8822480Z 
2025-05-07T20:24:36.6560339Z 
2025-05-07T20:24:36.6560936Z      active environment : base
2025-05-07T20:24:36.6561409Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:36.6561735Z             shell level : 1
2025-05-07T20:24:36.6562013Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:36.6562400Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:36.6562767Z           conda version : 25.3.1
2025-05-07T20:24:36.6563043Z     conda-build version : not installed
2025-05-07T20:24:36.6563339Z          python version : 3.13.2.final.0
2025-05-07T20:24:36.6563638Z                  solver : libmamba (default)
2025-05-07T20:24:36.6563946Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:36.6564236Z                           __conda=25.3.1=0
2025-05-07T20:24:36.6564517Z                           __cuda=12.8=0
2025-05-07T20:24:36.6564799Z                           __glibc=2.34=0
2025-05-07T20:24:36.6565077Z                           __linux=6.1.130=0
2025-05-07T20:24:36.6565630Z                           __unix=0=0
2025-05-07T20:24:36.6566360Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:36.6566772Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:36.6567114Z   conda av metadata url : None
2025-05-07T20:24:36.6567488Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:36.6567916Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:36.6568293Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:36.6568669Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:36.6569038Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:36.6569379Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:36.6569714Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:36.6570054Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:36.6570358Z                platform : linux-64
2025-05-07T20:24:36.6571178Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:36.6572150Z                 UID:GID : 1000:1000
2025-05-07T20:24:36.6572427Z              netrc file : None
2025-05-07T20:24:36.6572687Z            offline mode : False
2025-05-07T20:24:36.6572852Z 
2025-05-07T20:24:36.7212038Z 
2025-05-07T20:24:36.7212291Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:36.7213141Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_ce4d2be4-91d5-4eea-8431-0f6f6f174062 ...
2025-05-07T20:24:36.7214234Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:36.7295508Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12
2025-05-07T20:24:36.7295996Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.12[0m
2025-05-07T20:24:36.7312611Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:36.7312962Z env:
2025-05-07T20:24:36.7313195Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:36.7313493Z   BUILD_ENV: build_binary
2025-05-07T20:24:36.7313751Z   BUILD_TARGET: genai
2025-05-07T20:24:36.7313979Z   BUILD_VARIANT: cuda
2025-05-07T20:24:36.7314208Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:36.7314465Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:36.7314763Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:36.7315088Z ##[endgroup]
2025-05-07T20:24:37.0664985Z ################################################################################
2025-05-07T20:24:37.0665642Z # Create Conda Environment
2025-05-07T20:24:37.0665910Z #
2025-05-07T20:24:37.0682319Z # [2025-05-07T20:24:37.067Z] + create_conda_environment build_binary 3.12
2025-05-07T20:24:37.0682736Z ################################################################################
2025-05-07T20:24:37.0682952Z 
2025-05-07T20:24:37.0697584Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:37.1646005Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:37.1646378Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:37.1646721Z + conda info --envs
2025-05-07T20:24:37.1646859Z 
2025-05-07T20:24:37.9313476Z 
2025-05-07T20:24:37.9313839Z # conda environments:
2025-05-07T20:24:37.9314111Z #
2025-05-07T20:24:37.9314336Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:37.9314569Z 
2025-05-07T20:24:37.9987906Z 
2025-05-07T20:24:37.9988512Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:39.6381281Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:39.6381572Z 
2025-05-07T20:24:39.6397267Z 
2025-05-07T20:24:39.6406508Z [SETUP] Creating new Conda environment (Python 3.12) ...
2025-05-07T20:24:39.6428507Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.12
2025-05-07T20:24:40.4276251Z Channels:
2025-05-07T20:24:40.4276516Z  - defaults
2025-05-07T20:24:40.4276727Z Platform: linux-64
2025-05-07T20:24:41.9311510Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:42.0554495Z Solving environment: / done
2025-05-07T20:24:42.0843413Z 
2025-05-07T20:24:42.0843952Z ## Package Plan ##
2025-05-07T20:24:42.0844255Z 
2025-05-07T20:24:42.0844668Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:42.0845440Z 
2025-05-07T20:24:42.0845726Z   added / updated specs:
2025-05-07T20:24:42.0846293Z     - python=3.12
2025-05-07T20:24:42.0846574Z 
2025-05-07T20:24:42.0846583Z 
2025-05-07T20:24:42.0846826Z The following packages will be downloaded:
2025-05-07T20:24:42.0847266Z 
2025-05-07T20:24:42.0847517Z     package                    |            build
2025-05-07T20:24:42.0848168Z     ---------------------------|-----------------
2025-05-07T20:24:42.0848883Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:42.0849678Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:42.0850512Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:42.0851313Z     python-3.12.9              |       h5148396_0        34.7 MB
2025-05-07T20:24:42.0852432Z     setuptools-78.1.1          |  py312h06a4308_0         2.2 MB
2025-05-07T20:24:42.0853216Z     wheel-0.45.1               |  py312h06a4308_0         147 KB
2025-05-07T20:24:42.0853888Z     ------------------------------------------------------------
2025-05-07T20:24:42.0854274Z                                            Total:        37.2 MB
2025-05-07T20:24:42.0854493Z 
2025-05-07T20:24:42.0854622Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:42.0854845Z 
2025-05-07T20:24:42.0855237Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:42.0855690Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:42.0856110Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:42.0856595Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:42.0857085Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:42.0857540Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:42.0858007Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:42.0858439Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:42.0858878Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:42.0859334Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:42.0859800Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:42.0860225Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:42.0860651Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:42.0861058Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:42.0861464Z   python             pkgs/main/linux-64::python-3.12.9-h5148396_0 
2025-05-07T20:24:42.0861897Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:42.0862372Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 
2025-05-07T20:24:42.0862845Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:42.0863240Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:42.0863630Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:42.0864048Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 
2025-05-07T20:24:42.0864457Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:42.0864840Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:42.0865081Z 
2025-05-07T20:24:42.0865085Z 
2025-05-07T20:24:42.0865089Z 
2025-05-07T20:24:42.0865244Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:42.0865819Z python-3.12.9        | 34.7 MB   |            |   0% 
2025-05-07T20:24:42.0866055Z 
2025-05-07T20:24:42.0866317Z setuptools-78.1.1    | 2.2 MB    |            |   0% [A
2025-05-07T20:24:42.0866712Z 
2025-05-07T20:24:42.0866717Z 
2025-05-07T20:24:42.0867160Z wheel-0.45.1         | 147 KB    |            |   0% [A[A
2025-05-07T20:24:42.0867398Z 
2025-05-07T20:24:42.0867407Z 
2025-05-07T20:24:42.0868666Z 
2025-05-07T20:24:42.0893946Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:42.0894321Z 
2025-05-07T20:24:42.0894328Z 
2025-05-07T20:24:42.0894333Z 
2025-05-07T20:24:42.0894349Z 
2025-05-07T20:24:42.0911747Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:42.0912088Z 
2025-05-07T20:24:42.0912092Z 
2025-05-07T20:24:42.0912096Z 
2025-05-07T20:24:42.0912099Z 
2025-05-07T20:24:42.0912110Z 
2025-05-07T20:24:42.1214232Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:42.1214504Z 
2025-05-07T20:24:42.1216603Z 
2025-05-07T20:24:42.1250604Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.1251082Z 
2025-05-07T20:24:42.1251087Z 
2025-05-07T20:24:42.1251095Z 
2025-05-07T20:24:42.1254112Z 
2025-05-07T20:24:42.1452549Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.1452868Z 
2025-05-07T20:24:42.1452874Z 
2025-05-07T20:24:42.1452879Z 
2025-05-07T20:24:42.1452884Z 
2025-05-07T20:24:42.1455471Z 
2025-05-07T20:24:42.1792938Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:42.1793227Z 
2025-05-07T20:24:42.1793231Z 
2025-05-07T20:24:42.1793235Z 
2025-05-07T20:24:42.1793238Z 
2025-05-07T20:24:42.1793484Z 
2025-05-07T20:24:42.1845966Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:42.1853322Z python-3.12.9        | 34.7 MB   | 5          |   6% 
2025-05-07T20:24:42.1853657Z 
2025-05-07T20:24:42.1853662Z 
2025-05-07T20:24:42.1855673Z 
2025-05-07T20:24:42.2079397Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.2079684Z 
2025-05-07T20:24:42.2263961Z setuptools-78.1.1    | 2.2 MB    | #####7     |  58% [A
2025-05-07T20:24:42.2264332Z 
2025-05-07T20:24:42.2264336Z 
2025-05-07T20:24:42.2264340Z 
2025-05-07T20:24:42.2273664Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.2273943Z 
2025-05-07T20:24:42.2273947Z 
2025-05-07T20:24:42.2276146Z 
2025-05-07T20:24:42.2518124Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.2518404Z 
2025-05-07T20:24:42.2518416Z 
2025-05-07T20:24:42.2518420Z 
2025-05-07T20:24:42.2518731Z 
2025-05-07T20:24:42.2525853Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.2526128Z 
2025-05-07T20:24:42.2526138Z 
2025-05-07T20:24:42.2526142Z 
2025-05-07T20:24:42.2528321Z 
2025-05-07T20:24:42.2638582Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.2641772Z 
2025-05-07T20:24:42.2831366Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:42.2831624Z 
2025-05-07T20:24:42.2832140Z 
2025-05-07T20:24:42.2840372Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.2840677Z 
2025-05-07T20:24:42.2841324Z 
2025-05-07T20:24:42.2846293Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.3847893Z python-3.12.9        | 34.7 MB   | ##         |  21% 
2025-05-07T20:24:42.5704936Z python-3.12.9        | 34.7 MB   | #####6     |  56% 
2025-05-07T20:24:42.5705426Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:42.6467921Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:42.6468211Z 
2025-05-07T20:24:43.2298597Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:43.2305861Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:43.2306264Z                                                      
2025-05-07T20:24:43.2306461Z 
2025-05-07T20:24:43.2306656Z                                                      [A
2025-05-07T20:24:43.2306861Z 
2025-05-07T20:24:43.2306881Z 
2025-05-07T20:24:43.2307047Z                                                      [A[A
2025-05-07T20:24:43.2307256Z 
2025-05-07T20:24:43.2307259Z 
2025-05-07T20:24:43.2307263Z 
2025-05-07T20:24:43.2307434Z                                                      [A[A[A
2025-05-07T20:24:43.2307642Z 
2025-05-07T20:24:43.2307646Z 
2025-05-07T20:24:43.2307650Z 
2025-05-07T20:24:43.2307653Z 
2025-05-07T20:24:43.2307825Z                                                      [A[A[A[A
2025-05-07T20:24:43.2308028Z 
2025-05-07T20:24:43.2308032Z 
2025-05-07T20:24:43.2308045Z 
2025-05-07T20:24:43.2308056Z 
2025-05-07T20:24:43.2308060Z 
2025-05-07T20:24:43.2308263Z                                                      [A[A[A[A[A done
2025-05-07T20:24:43.4415474Z Preparing transaction: \ | done
2025-05-07T20:24:44.8785185Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:24:47.2994133Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:47.3498448Z #
2025-05-07T20:24:47.3498782Z # To activate this environment, use
2025-05-07T20:24:47.3499182Z #
2025-05-07T20:24:47.3499458Z #     $ conda activate build_binary
2025-05-07T20:24:47.3499741Z #
2025-05-07T20:24:47.3499959Z # To deactivate an active environment, use
2025-05-07T20:24:47.3500245Z #
2025-05-07T20:24:47.3500431Z #     $ conda deactivate
2025-05-07T20:24:47.3500594Z 
2025-05-07T20:24:47.4579173Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:47.4603925Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:50.4370511Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1)
2025-05-07T20:24:50.4371128Z Collecting pip
2025-05-07T20:24:50.4371440Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:50.4371867Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:50.4375748Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 64.5 MB/s eta 0:00:00
2025-05-07T20:24:50.4376155Z Installing collected packages: pip
2025-05-07T20:24:50.4376450Z   Attempting uninstall: pip
2025-05-07T20:24:50.4376733Z     Found existing installation: pip 25.1
2025-05-07T20:24:50.4377050Z     Uninstalling pip-25.1:
2025-05-07T20:24:50.4377319Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:50.4377630Z Successfully installed pip-25.1.1
2025-05-07T20:24:50.4377817Z 
2025-05-07T20:24:50.4991832Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:50.5014798Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:51.3781873Z Channels:
2025-05-07T20:24:51.3782185Z  - conda-forge
2025-05-07T20:24:51.3782416Z Platform: linux-64
2025-05-07T20:25:01.7921259Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:03.4829960Z Solving environment: - \ | / - \ done
2025-05-07T20:25:03.5460347Z 
2025-05-07T20:25:03.5460539Z ## Package Plan ##
2025-05-07T20:25:03.5460759Z 
2025-05-07T20:25:03.5461035Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:03.5461495Z 
2025-05-07T20:25:03.5461609Z   added / updated specs:
2025-05-07T20:25:03.5461884Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:25:03.5462099Z 
2025-05-07T20:25:03.5462105Z 
2025-05-07T20:25:03.5462282Z The following packages will be downloaded:
2025-05-07T20:25:03.5462606Z 
2025-05-07T20:25:03.5462795Z     package                    |            build
2025-05-07T20:25:03.5463269Z     ---------------------------|-----------------
2025-05-07T20:25:03.5463811Z     cffi-1.17.1                |  py312h06ac9bb_0         288 KB  conda-forge
2025-05-07T20:25:03.5464341Z     cryptography-44.0.3        |  py312hda17c39_0         1.5 MB  conda-forge
2025-05-07T20:25:03.5464967Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:03.5465779Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:03.5466211Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:25:03.5466622Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:25:03.5467043Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:25:03.5467452Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:03.5467882Z     libsqlite-3.46.0           |       hde9e2c9_0         845 KB  conda-forge
2025-05-07T20:25:03.5468301Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:03.5468723Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:25:03.5469146Z     libzlib-1.2.13             |       h4ab18f5_6          60 KB  conda-forge
2025-05-07T20:25:03.5469554Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:25:03.5470247Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:25:03.5470687Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:25:03.5471132Z     python-3.12.2              |hab00c5b_0_cpython        30.8 MB  conda-forge
2025-05-07T20:25:03.5471555Z     python_abi-3.12            |          7_cp312           7 KB  conda-forge
2025-05-07T20:25:03.5472017Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:25:03.5472827Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:25:03.5473274Z     zlib-1.2.13                |       h4ab18f5_6          91 KB  conda-forge
2025-05-07T20:25:03.5473656Z     ------------------------------------------------------------
2025-05-07T20:25:03.5474003Z                                            Total:        38.6 MB
2025-05-07T20:25:03.5474216Z 
2025-05-07T20:25:03.5474360Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:03.5474580Z 
2025-05-07T20:25:03.5474775Z   cffi               conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 
2025-05-07T20:25:03.5475275Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 
2025-05-07T20:25:03.5475867Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:03.5476311Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:25:03.5476732Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:03.5478874Z   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
2025-05-07T20:25:03.5479351Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:25:03.5479802Z   libzlib            conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 
2025-05-07T20:25:03.5480265Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:25:03.5480819Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:25:03.5481412Z   python_abi         conda-forge/noarch::python_abi-3.12-7_cp312 
2025-05-07T20:25:03.5481925Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:25:03.5482508Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:25:03.5482855Z 
2025-05-07T20:25:03.5482970Z The following packages will be UPDATED:
2025-05-07T20:25:03.5483176Z 
2025-05-07T20:25:03.5483575Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:25:03.5484334Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:25:03.5484975Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:25:03.5485599Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:03.5486281Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:25:03.5486941Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 
2025-05-07T20:25:03.5487269Z 
2025-05-07T20:25:03.5487485Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:03.5487805Z 
2025-05-07T20:25:03.5488043Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:03.5488668Z   python                pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 
2025-05-07T20:25:03.5489049Z 
2025-05-07T20:25:03.5489053Z 
2025-05-07T20:25:03.5489063Z 
2025-05-07T20:25:03.5489205Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:03.5489592Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:25:03.5489818Z 
2025-05-07T20:25:03.5490329Z openssl-3.5.0        | 3.0 MB    |            |   0% [A
2025-05-07T20:25:03.5490567Z 
2025-05-07T20:25:03.5490570Z 
2025-05-07T20:25:03.5490795Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:25:03.5491056Z 
2025-05-07T20:25:03.5491060Z 
2025-05-07T20:25:03.5491064Z 
2025-05-07T20:25:03.5509709Z libsqlite-3.46.0     | 845 KB    |            |   0% [A[A[A
2025-05-07T20:25:03.5510060Z 
2025-05-07T20:25:03.5510064Z 
2025-05-07T20:25:03.5510076Z 
2025-05-07T20:25:03.5510080Z 
2025-05-07T20:25:03.5516239Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A[A[A
2025-05-07T20:25:03.5516607Z 
2025-05-07T20:25:03.5516611Z 
2025-05-07T20:25:03.5516622Z 
2025-05-07T20:25:03.5516625Z 
2025-05-07T20:25:03.5516629Z 
2025-05-07T20:25:03.5529454Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:03.5529809Z 
2025-05-07T20:25:03.5529812Z 
2025-05-07T20:25:03.5529824Z 
2025-05-07T20:25:03.5529828Z 
2025-05-07T20:25:03.5529832Z 
2025-05-07T20:25:03.5531158Z 
2025-05-07T20:25:03.5538277Z cffi-1.17.1          | 288 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:03.5538597Z 
2025-05-07T20:25:03.5538602Z 
2025-05-07T20:25:03.5538606Z 
2025-05-07T20:25:03.5538609Z 
2025-05-07T20:25:03.5538613Z 
2025-05-07T20:25:03.5538617Z 
2025-05-07T20:25:03.5542112Z 
2025-05-07T20:25:03.5543747Z expat-2.7.0          | 137 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:03.5544024Z 
2025-05-07T20:25:03.5544028Z 
2025-05-07T20:25:03.5544032Z 
2025-05-07T20:25:03.5544036Z 
2025-05-07T20:25:03.5544039Z 
2025-05-07T20:25:03.5544055Z 
2025-05-07T20:25:03.5544060Z 
2025-05-07T20:25:03.5544063Z 
2025-05-07T20:25:03.5545726Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5546028Z 
2025-05-07T20:25:03.5546038Z 
2025-05-07T20:25:03.5546044Z 
2025-05-07T20:25:03.5546049Z 
2025-05-07T20:25:03.5546055Z 
2025-05-07T20:25:03.5546062Z 
2025-05-07T20:25:03.5546068Z 
2025-05-07T20:25:03.5546075Z 
2025-05-07T20:25:03.5546099Z 
2025-05-07T20:25:03.5547963Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5548268Z 
2025-05-07T20:25:03.5548272Z 
2025-05-07T20:25:03.5548276Z 
2025-05-07T20:25:03.5548280Z 
2025-05-07T20:25:03.5548283Z 
2025-05-07T20:25:03.5548287Z 
2025-05-07T20:25:03.5548290Z 
2025-05-07T20:25:03.5548294Z 
2025-05-07T20:25:03.5548298Z 
2025-05-07T20:25:03.5549956Z 
2025-05-07T20:25:03.5553014Z libxcrypt-4.4.36     | 98 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5553688Z 
2025-05-07T20:25:03.5553711Z 
2025-05-07T20:25:03.5553722Z 
2025-05-07T20:25:03.5553730Z 
2025-05-07T20:25:03.5553738Z 
2025-05-07T20:25:03.5553745Z 
2025-05-07T20:25:03.5553753Z 
2025-05-07T20:25:03.5553759Z 
2025-05-07T20:25:03.5553765Z 
2025-05-07T20:25:03.5553771Z 
2025-05-07T20:25:03.5553779Z 
2025-05-07T20:25:03.5555322Z zlib-1.2.13          | 91 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5555947Z 
2025-05-07T20:25:03.5555968Z 
2025-05-07T20:25:03.5555976Z 
2025-05-07T20:25:03.5555985Z 
2025-05-07T20:25:03.5555994Z 
2025-05-07T20:25:03.5556002Z 
2025-05-07T20:25:03.5556010Z 
2025-05-07T20:25:03.5556018Z 
2025-05-07T20:25:03.5556027Z 
2025-05-07T20:25:03.5556041Z 
2025-05-07T20:25:03.5556063Z 
2025-05-07T20:25:03.5556072Z 
2025-05-07T20:25:03.5558089Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5558599Z 
2025-05-07T20:25:03.5558606Z 
2025-05-07T20:25:03.5558622Z 
2025-05-07T20:25:03.5558627Z 
2025-05-07T20:25:03.5558644Z 
2025-05-07T20:25:03.5558650Z 
2025-05-07T20:25:03.5558655Z 
2025-05-07T20:25:03.5558660Z 
2025-05-07T20:25:03.5558666Z 
2025-05-07T20:25:03.5558671Z 
2025-05-07T20:25:03.5558676Z 
2025-05-07T20:25:03.5558681Z 
2025-05-07T20:25:03.5558686Z 
2025-05-07T20:25:03.5559162Z libexpat-2.7.0       | 73 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5559628Z 
2025-05-07T20:25:03.5559635Z 
2025-05-07T20:25:03.5559809Z 
2025-05-07T20:25:03.5559815Z 
2025-05-07T20:25:03.5559820Z 
2025-05-07T20:25:03.5559826Z 
2025-05-07T20:25:03.5559831Z 
2025-05-07T20:25:03.5559836Z 
2025-05-07T20:25:03.5559842Z 
2025-05-07T20:25:03.5559847Z 
2025-05-07T20:25:03.5559853Z 
2025-05-07T20:25:03.5559858Z 
2025-05-07T20:25:03.5559863Z 
2025-05-07T20:25:03.5559869Z 
2025-05-07T20:25:03.5560972Z libzlib-1.2.13       | 60 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5561433Z 
2025-05-07T20:25:03.5561439Z 
2025-05-07T20:25:03.5561444Z 
2025-05-07T20:25:03.5561608Z 
2025-05-07T20:25:03.5561616Z 
2025-05-07T20:25:03.5561622Z 
2025-05-07T20:25:03.5561627Z 
2025-05-07T20:25:03.5561644Z 
2025-05-07T20:25:03.5561650Z 
2025-05-07T20:25:03.5561655Z 
2025-05-07T20:25:03.5561661Z 
2025-05-07T20:25:03.5561666Z 
2025-05-07T20:25:03.5561671Z 
2025-05-07T20:25:03.5561677Z 
2025-05-07T20:25:03.5561690Z 
2025-05-07T20:25:03.5562697Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5563215Z 
2025-05-07T20:25:03.5563221Z 
2025-05-07T20:25:03.5563227Z 
2025-05-07T20:25:03.5563233Z 
2025-05-07T20:25:03.5563238Z 
2025-05-07T20:25:03.5563244Z 
2025-05-07T20:25:03.5563249Z 
2025-05-07T20:25:03.5563255Z 
2025-05-07T20:25:03.5563260Z 
2025-05-07T20:25:03.5563265Z 
2025-05-07T20:25:03.5563280Z 
2025-05-07T20:25:03.5563286Z 
2025-05-07T20:25:03.5563292Z 
2025-05-07T20:25:03.5563297Z 
2025-05-07T20:25:03.5563302Z 
2025-05-07T20:25:03.5563308Z 
2025-05-07T20:25:03.5564415Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5564886Z 
2025-05-07T20:25:03.5564893Z 
2025-05-07T20:25:03.5564898Z 
2025-05-07T20:25:03.5564904Z 
2025-05-07T20:25:03.5564909Z 
2025-05-07T20:25:03.5564914Z 
2025-05-07T20:25:03.5564931Z 
2025-05-07T20:25:03.5564937Z 
2025-05-07T20:25:03.5564942Z 
2025-05-07T20:25:03.5564948Z 
2025-05-07T20:25:03.5564954Z 
2025-05-07T20:25:03.5564959Z 
2025-05-07T20:25:03.5564973Z 
2025-05-07T20:25:03.5564978Z 
2025-05-07T20:25:03.5564984Z 
2025-05-07T20:25:03.5564990Z 
2025-05-07T20:25:03.5564995Z 
2025-05-07T20:25:03.5566369Z libuuid-2.38.1       | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5566846Z 
2025-05-07T20:25:03.5566852Z 
2025-05-07T20:25:03.5566857Z 
2025-05-07T20:25:03.5566862Z 
2025-05-07T20:25:03.5566868Z 
2025-05-07T20:25:03.5566873Z 
2025-05-07T20:25:03.5566878Z 
2025-05-07T20:25:03.5566884Z 
2025-05-07T20:25:03.5566889Z 
2025-05-07T20:25:03.5566894Z 
2025-05-07T20:25:03.5566910Z 
2025-05-07T20:25:03.5566917Z 
2025-05-07T20:25:03.5566923Z 
2025-05-07T20:25:03.5566928Z 
2025-05-07T20:25:03.5566934Z 
2025-05-07T20:25:03.5566940Z 
2025-05-07T20:25:03.5566960Z 
2025-05-07T20:25:03.5566966Z 
2025-05-07T20:25:03.5567905Z libnsl-2.0.1         | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.5568292Z 
2025-05-07T20:25:03.5568296Z 
2025-05-07T20:25:03.5568316Z 
2025-05-07T20:25:03.5568319Z 
2025-05-07T20:25:03.5568323Z 
2025-05-07T20:25:03.5568327Z 
2025-05-07T20:25:03.5568330Z 
2025-05-07T20:25:03.5568334Z 
2025-05-07T20:25:03.5568337Z 
2025-05-07T20:25:03.5568341Z 
2025-05-07T20:25:03.5568344Z 
2025-05-07T20:25:03.5568348Z 
2025-05-07T20:25:03.5568358Z 
2025-05-07T20:25:03.5568361Z 
2025-05-07T20:25:03.5568365Z 
2025-05-07T20:25:03.5568368Z 
2025-05-07T20:25:03.5568372Z 
2025-05-07T20:25:03.5568375Z 
2025-05-07T20:25:03.5568379Z 
2025-05-07T20:25:03.6262003Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6262304Z 
2025-05-07T20:25:03.6262308Z 
2025-05-07T20:25:03.6263655Z 
2025-05-07T20:25:03.6396870Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.6405797Z 
2025-05-07T20:25:03.6492491Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:03.6492873Z 
2025-05-07T20:25:03.6492879Z 
2025-05-07T20:25:03.6492885Z 
2025-05-07T20:25:03.6492890Z 
2025-05-07T20:25:03.6683228Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:03.6712995Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:25:03.6713247Z 
2025-05-07T20:25:03.6713253Z 
2025-05-07T20:25:03.6713259Z 
2025-05-07T20:25:03.6713264Z 
2025-05-07T20:25:03.6713480Z 
2025-05-07T20:25:03.6718823Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A[A[A
2025-05-07T20:25:03.6719099Z 
2025-05-07T20:25:03.6719103Z 
2025-05-07T20:25:03.6720662Z 
2025-05-07T20:25:03.6723630Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.6723895Z 
2025-05-07T20:25:03.6723899Z 
2025-05-07T20:25:03.6725490Z 
2025-05-07T20:25:03.6874319Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.6874616Z 
2025-05-07T20:25:03.6874621Z 
2025-05-07T20:25:03.6874626Z 
2025-05-07T20:25:03.6874632Z 
2025-05-07T20:25:03.6874637Z 
2025-05-07T20:25:03.6874642Z 
2025-05-07T20:25:03.6924078Z cffi-1.17.1          | 288 KB    | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:25:03.6924439Z 
2025-05-07T20:25:03.6924444Z 
2025-05-07T20:25:03.6924447Z 
2025-05-07T20:25:03.6924459Z 
2025-05-07T20:25:03.6926399Z 
2025-05-07T20:25:03.6998554Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:03.6998824Z 
2025-05-07T20:25:03.6998828Z 
2025-05-07T20:25:03.7066912Z cryptography-44.0.3  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:25:03.7067185Z 
2025-05-07T20:25:03.7067189Z 
2025-05-07T20:25:03.7067192Z 
2025-05-07T20:25:03.7067196Z 
2025-05-07T20:25:03.7067210Z 
2025-05-07T20:25:03.7068214Z 
2025-05-07T20:25:03.7419589Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:03.7419853Z 
2025-05-07T20:25:03.7419857Z 
2025-05-07T20:25:03.7419860Z 
2025-05-07T20:25:03.7419864Z 
2025-05-07T20:25:03.7419874Z 
2025-05-07T20:25:03.7419878Z 
2025-05-07T20:25:03.7419881Z 
2025-05-07T20:25:03.7422039Z 
2025-05-07T20:25:03.7444514Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.7444946Z 
2025-05-07T20:25:03.7444953Z 
2025-05-07T20:25:03.7444958Z 
2025-05-07T20:25:03.7444962Z 
2025-05-07T20:25:03.7444967Z 
2025-05-07T20:25:03.7444972Z 
2025-05-07T20:25:03.7451374Z 
2025-05-07T20:25:03.7461125Z expat-2.7.0          | 137 KB    | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:25:03.7461504Z 
2025-05-07T20:25:03.7461510Z 
2025-05-07T20:25:03.7461515Z 
2025-05-07T20:25:03.7461520Z 
2025-05-07T20:25:03.7461525Z 
2025-05-07T20:25:03.7461530Z 
2025-05-07T20:25:03.7461535Z 
2025-05-07T20:25:03.7461558Z 
2025-05-07T20:25:03.7468215Z 
2025-05-07T20:25:03.7548894Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.7549333Z 
2025-05-07T20:25:03.7549337Z 
2025-05-07T20:25:03.7549340Z 
2025-05-07T20:25:03.7549344Z 
2025-05-07T20:25:03.7549348Z 
2025-05-07T20:25:03.7549351Z 
2025-05-07T20:25:03.7549355Z 
2025-05-07T20:25:03.7549358Z 
2025-05-07T20:25:03.7602113Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.7602544Z 
2025-05-07T20:25:03.7602549Z 
2025-05-07T20:25:03.7602555Z 
2025-05-07T20:25:03.7602560Z 
2025-05-07T20:25:03.7602565Z 
2025-05-07T20:25:03.7602570Z 
2025-05-07T20:25:03.7602580Z 
2025-05-07T20:25:03.7629892Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:03.7630268Z 
2025-05-07T20:25:03.7630376Z 
2025-05-07T20:25:03.7630384Z 
2025-05-07T20:25:03.7630392Z 
2025-05-07T20:25:03.7630397Z 
2025-05-07T20:25:03.7630402Z 
2025-05-07T20:25:03.7630408Z 
2025-05-07T20:25:03.7630427Z 
2025-05-07T20:25:03.7630434Z 
2025-05-07T20:25:03.7684695Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.7966277Z python-3.12.2        | 30.8 MB   | 8          |   8% 
2025-05-07T20:25:03.7966656Z 
2025-05-07T20:25:03.7966662Z 
2025-05-07T20:25:03.7966666Z 
2025-05-07T20:25:03.7966671Z 
2025-05-07T20:25:03.7966676Z 
2025-05-07T20:25:03.7966682Z 
2025-05-07T20:25:03.7966931Z 
2025-05-07T20:25:03.7966936Z 
2025-05-07T20:25:03.7966942Z 
2025-05-07T20:25:03.7966947Z 
2025-05-07T20:25:03.7966962Z 
2025-05-07T20:25:03.7969740Z 
2025-05-07T20:25:03.8039751Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8040253Z 
2025-05-07T20:25:03.8040267Z 
2025-05-07T20:25:03.8040274Z 
2025-05-07T20:25:03.8040279Z 
2025-05-07T20:25:03.8040285Z 
2025-05-07T20:25:03.8040290Z 
2025-05-07T20:25:03.8040295Z 
2025-05-07T20:25:03.8040301Z 
2025-05-07T20:25:03.8040306Z 
2025-05-07T20:25:03.8040551Z 
2025-05-07T20:25:03.8040558Z 
2025-05-07T20:25:03.8041789Z 
2025-05-07T20:25:03.8088886Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8089223Z 
2025-05-07T20:25:03.8089227Z 
2025-05-07T20:25:03.8089231Z 
2025-05-07T20:25:03.8089234Z 
2025-05-07T20:25:03.8089238Z 
2025-05-07T20:25:03.8089241Z 
2025-05-07T20:25:03.8089245Z 
2025-05-07T20:25:03.8089248Z 
2025-05-07T20:25:03.8089263Z 
2025-05-07T20:25:03.8089266Z 
2025-05-07T20:25:03.8128048Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8128341Z 
2025-05-07T20:25:03.8128345Z 
2025-05-07T20:25:03.8129037Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:03.8129311Z 
2025-05-07T20:25:03.8131458Z 
2025-05-07T20:25:03.8172689Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:03.8172960Z 
2025-05-07T20:25:03.8173108Z 
2025-05-07T20:25:03.8173113Z 
2025-05-07T20:25:03.8173117Z 
2025-05-07T20:25:03.8173132Z 
2025-05-07T20:25:03.8173136Z 
2025-05-07T20:25:03.8173140Z 
2025-05-07T20:25:03.8173144Z 
2025-05-07T20:25:03.8173147Z 
2025-05-07T20:25:03.8173154Z 
2025-05-07T20:25:03.8218506Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8218812Z 
2025-05-07T20:25:03.8218816Z 
2025-05-07T20:25:03.8218820Z 
2025-05-07T20:25:03.8218823Z 
2025-05-07T20:25:03.8218827Z 
2025-05-07T20:25:03.8218843Z 
2025-05-07T20:25:03.8218847Z 
2025-05-07T20:25:03.8218851Z 
2025-05-07T20:25:03.8218854Z 
2025-05-07T20:25:03.8218858Z 
2025-05-07T20:25:03.8222318Z 
2025-05-07T20:25:03.8288260Z zlib-1.2.13          | 91 KB     | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8288535Z 
2025-05-07T20:25:03.8288540Z 
2025-05-07T20:25:03.8288544Z 
2025-05-07T20:25:03.8288547Z 
2025-05-07T20:25:03.8288551Z 
2025-05-07T20:25:03.8288560Z 
2025-05-07T20:25:03.8288564Z 
2025-05-07T20:25:03.8288568Z 
2025-05-07T20:25:03.8288571Z 
2025-05-07T20:25:03.8288584Z 
2025-05-07T20:25:03.8288588Z 
2025-05-07T20:25:03.8367486Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8367756Z 
2025-05-07T20:25:03.8367765Z 
2025-05-07T20:25:03.8367789Z 
2025-05-07T20:25:03.8367921Z 
2025-05-07T20:25:03.8374644Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:03.8374977Z 
2025-05-07T20:25:03.8374982Z 
2025-05-07T20:25:03.8374997Z 
2025-05-07T20:25:03.8375001Z 
2025-05-07T20:25:03.8464512Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:03.8464961Z 
2025-05-07T20:25:03.8464968Z 
2025-05-07T20:25:03.8464975Z 
2025-05-07T20:25:03.8464980Z 
2025-05-07T20:25:03.8464986Z 
2025-05-07T20:25:03.8464991Z 
2025-05-07T20:25:03.8464997Z 
2025-05-07T20:25:03.8465003Z 
2025-05-07T20:25:03.8465009Z 
2025-05-07T20:25:03.8465014Z 
2025-05-07T20:25:03.8465020Z 
2025-05-07T20:25:03.8465027Z 
2025-05-07T20:25:03.8465032Z 
2025-05-07T20:25:03.8527316Z libexpat-2.7.0       | 73 KB     | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8527803Z 
2025-05-07T20:25:03.8527809Z 
2025-05-07T20:25:03.8527815Z 
2025-05-07T20:25:03.8527821Z 
2025-05-07T20:25:03.8529787Z 
2025-05-07T20:25:03.8538795Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:03.8539151Z 
2025-05-07T20:25:03.8539155Z 
2025-05-07T20:25:03.8539159Z 
2025-05-07T20:25:03.8539162Z 
2025-05-07T20:25:03.8539378Z 
2025-05-07T20:25:03.8554004Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:03.8554267Z 
2025-05-07T20:25:03.8554275Z 
2025-05-07T20:25:03.8554279Z 
2025-05-07T20:25:03.8554283Z 
2025-05-07T20:25:03.8554288Z 
2025-05-07T20:25:03.8554293Z 
2025-05-07T20:25:03.8554297Z 
2025-05-07T20:25:03.8554302Z 
2025-05-07T20:25:03.8554307Z 
2025-05-07T20:25:03.8554312Z 
2025-05-07T20:25:03.8554317Z 
2025-05-07T20:25:03.8554322Z 
2025-05-07T20:25:03.8557237Z 
2025-05-07T20:25:03.8684959Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8714174Z python-3.12.2        | 30.8 MB   | ##2        |  23% 
2025-05-07T20:25:03.8714406Z 
2025-05-07T20:25:03.8714745Z 
2025-05-07T20:25:03.8714749Z 
2025-05-07T20:25:03.8714764Z 
2025-05-07T20:25:03.8714767Z 
2025-05-07T20:25:03.8714771Z 
2025-05-07T20:25:03.8714796Z 
2025-05-07T20:25:03.8714800Z 
2025-05-07T20:25:03.8714813Z 
2025-05-07T20:25:03.8714817Z 
2025-05-07T20:25:03.8714829Z 
2025-05-07T20:25:03.8714842Z 
2025-05-07T20:25:03.8714979Z 
2025-05-07T20:25:03.8714984Z 
2025-05-07T20:25:03.8715164Z 
2025-05-07T20:25:03.8754524Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8754929Z 
2025-05-07T20:25:03.8754935Z 
2025-05-07T20:25:03.8754941Z 
2025-05-07T20:25:03.8754947Z 
2025-05-07T20:25:03.8754953Z 
2025-05-07T20:25:03.8754958Z 
2025-05-07T20:25:03.8754964Z 
2025-05-07T20:25:03.8754969Z 
2025-05-07T20:25:03.8754982Z 
2025-05-07T20:25:03.8754987Z 
2025-05-07T20:25:03.8755024Z 
2025-05-07T20:25:03.8755029Z 
2025-05-07T20:25:03.8755034Z 
2025-05-07T20:25:03.8755037Z 
2025-05-07T20:25:03.8755041Z 
2025-05-07T20:25:03.8841048Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8841390Z 
2025-05-07T20:25:03.8841394Z 
2025-05-07T20:25:03.8841398Z 
2025-05-07T20:25:03.8841402Z 
2025-05-07T20:25:03.8841407Z 
2025-05-07T20:25:03.8841436Z 
2025-05-07T20:25:03.8841441Z 
2025-05-07T20:25:03.8841446Z 
2025-05-07T20:25:03.8841451Z 
2025-05-07T20:25:03.8841457Z 
2025-05-07T20:25:03.8841461Z 
2025-05-07T20:25:03.8841467Z 
2025-05-07T20:25:03.8841472Z 
2025-05-07T20:25:03.8841477Z 
2025-05-07T20:25:03.8841481Z 
2025-05-07T20:25:03.8841486Z 
2025-05-07T20:25:03.8882021Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8882339Z 
2025-05-07T20:25:03.8882343Z 
2025-05-07T20:25:03.8882347Z 
2025-05-07T20:25:03.8882350Z 
2025-05-07T20:25:03.8882365Z 
2025-05-07T20:25:03.8882369Z 
2025-05-07T20:25:03.8882373Z 
2025-05-07T20:25:03.8882377Z 
2025-05-07T20:25:03.8882380Z 
2025-05-07T20:25:03.8882392Z 
2025-05-07T20:25:03.8882396Z 
2025-05-07T20:25:03.8882399Z 
2025-05-07T20:25:03.8882403Z 
2025-05-07T20:25:03.8882407Z 
2025-05-07T20:25:03.8882410Z 
2025-05-07T20:25:03.8882414Z 
2025-05-07T20:25:03.8982199Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8982532Z 
2025-05-07T20:25:03.8982537Z 
2025-05-07T20:25:03.8982591Z 
2025-05-07T20:25:03.8982601Z 
2025-05-07T20:25:03.8982607Z 
2025-05-07T20:25:03.8982613Z 
2025-05-07T20:25:03.8982714Z 
2025-05-07T20:25:03.8982722Z 
2025-05-07T20:25:03.8982728Z 
2025-05-07T20:25:03.8982734Z 
2025-05-07T20:25:03.8982740Z 
2025-05-07T20:25:03.8982748Z 
2025-05-07T20:25:03.8982754Z 
2025-05-07T20:25:03.8983943Z 
2025-05-07T20:25:03.9006257Z libzlib-1.2.13       | 60 KB     | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9006577Z 
2025-05-07T20:25:03.9006583Z 
2025-05-07T20:25:03.9006589Z 
2025-05-07T20:25:03.9006597Z 
2025-05-07T20:25:03.9006603Z 
2025-05-07T20:25:03.9006609Z 
2025-05-07T20:25:03.9006614Z 
2025-05-07T20:25:03.9006618Z 
2025-05-07T20:25:03.9006623Z 
2025-05-07T20:25:03.9006627Z 
2025-05-07T20:25:03.9006639Z 
2025-05-07T20:25:03.9006644Z 
2025-05-07T20:25:03.9006647Z 
2025-05-07T20:25:03.9006651Z 
2025-05-07T20:25:03.9006654Z 
2025-05-07T20:25:03.9006849Z 
2025-05-07T20:25:03.9006856Z 
2025-05-07T20:25:03.9037514Z libuuid-2.38.1       | 33 KB     | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9037819Z 
2025-05-07T20:25:03.9037823Z 
2025-05-07T20:25:03.9037826Z 
2025-05-07T20:25:03.9037830Z 
2025-05-07T20:25:03.9037833Z 
2025-05-07T20:25:03.9037837Z 
2025-05-07T20:25:03.9037840Z 
2025-05-07T20:25:03.9037844Z 
2025-05-07T20:25:03.9037847Z 
2025-05-07T20:25:03.9037851Z 
2025-05-07T20:25:03.9037854Z 
2025-05-07T20:25:03.9037858Z 
2025-05-07T20:25:03.9037861Z 
2025-05-07T20:25:03.9040452Z 
2025-05-07T20:25:03.9074397Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9074697Z 
2025-05-07T20:25:03.9074700Z 
2025-05-07T20:25:03.9074704Z 
2025-05-07T20:25:03.9074707Z 
2025-05-07T20:25:03.9074711Z 
2025-05-07T20:25:03.9074714Z 
2025-05-07T20:25:03.9074718Z 
2025-05-07T20:25:03.9074721Z 
2025-05-07T20:25:03.9074732Z 
2025-05-07T20:25:03.9074749Z 
2025-05-07T20:25:03.9074752Z 
2025-05-07T20:25:03.9074756Z 
2025-05-07T20:25:03.9074759Z 
2025-05-07T20:25:03.9074763Z 
2025-05-07T20:25:03.9074766Z 
2025-05-07T20:25:03.9074770Z 
2025-05-07T20:25:03.9074773Z 
2025-05-07T20:25:03.9147383Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9147684Z 
2025-05-07T20:25:03.9147950Z 
2025-05-07T20:25:03.9147958Z 
2025-05-07T20:25:03.9147964Z 
2025-05-07T20:25:03.9147970Z 
2025-05-07T20:25:03.9147975Z 
2025-05-07T20:25:03.9147981Z 
2025-05-07T20:25:03.9148006Z 
2025-05-07T20:25:03.9148012Z 
2025-05-07T20:25:03.9148017Z 
2025-05-07T20:25:03.9148048Z 
2025-05-07T20:25:03.9148054Z 
2025-05-07T20:25:03.9148060Z 
2025-05-07T20:25:03.9148065Z 
2025-05-07T20:25:03.9148069Z 
2025-05-07T20:25:03.9148073Z 
2025-05-07T20:25:03.9148078Z 
2025-05-07T20:25:03.9148516Z 
2025-05-07T20:25:03.9185478Z libnsl-2.0.1         | 33 KB     | ####9      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9185945Z 
2025-05-07T20:25:03.9185949Z 
2025-05-07T20:25:03.9185954Z 
2025-05-07T20:25:03.9185958Z 
2025-05-07T20:25:03.9185963Z 
2025-05-07T20:25:03.9185969Z 
2025-05-07T20:25:03.9185973Z 
2025-05-07T20:25:03.9185977Z 
2025-05-07T20:25:03.9185980Z 
2025-05-07T20:25:03.9185984Z 
2025-05-07T20:25:03.9185988Z 
2025-05-07T20:25:03.9185991Z 
2025-05-07T20:25:03.9185995Z 
2025-05-07T20:25:03.9185999Z 
2025-05-07T20:25:03.9186002Z 
2025-05-07T20:25:03.9186006Z 
2025-05-07T20:25:03.9186009Z 
2025-05-07T20:25:03.9189535Z 
2025-05-07T20:25:03.9351297Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9351712Z 
2025-05-07T20:25:03.9351718Z 
2025-05-07T20:25:03.9351725Z 
2025-05-07T20:25:03.9351740Z 
2025-05-07T20:25:03.9351746Z 
2025-05-07T20:25:03.9351752Z 
2025-05-07T20:25:03.9351757Z 
2025-05-07T20:25:03.9351763Z 
2025-05-07T20:25:03.9358713Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9359036Z 
2025-05-07T20:25:03.9359040Z 
2025-05-07T20:25:03.9359043Z 
2025-05-07T20:25:03.9359047Z 
2025-05-07T20:25:03.9359050Z 
2025-05-07T20:25:03.9359054Z 
2025-05-07T20:25:03.9359058Z 
2025-05-07T20:25:03.9359120Z 
2025-05-07T20:25:03.9466518Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9466927Z 
2025-05-07T20:25:03.9466931Z 
2025-05-07T20:25:03.9466935Z 
2025-05-07T20:25:03.9466939Z 
2025-05-07T20:25:03.9466942Z 
2025-05-07T20:25:03.9466946Z 
2025-05-07T20:25:03.9466949Z 
2025-05-07T20:25:03.9466965Z 
2025-05-07T20:25:03.9466969Z 
2025-05-07T20:25:03.9466973Z 
2025-05-07T20:25:03.9466977Z 
2025-05-07T20:25:03.9466980Z 
2025-05-07T20:25:03.9466984Z 
2025-05-07T20:25:03.9466987Z 
2025-05-07T20:25:03.9466991Z 
2025-05-07T20:25:03.9466994Z 
2025-05-07T20:25:03.9466998Z 
2025-05-07T20:25:03.9467001Z 
2025-05-07T20:25:03.9468354Z 
2025-05-07T20:25:03.9520523Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9521249Z 
2025-05-07T20:25:03.9521266Z 
2025-05-07T20:25:03.9521272Z 
2025-05-07T20:25:03.9521278Z 
2025-05-07T20:25:03.9521284Z 
2025-05-07T20:25:03.9521290Z 
2025-05-07T20:25:03.9521296Z 
2025-05-07T20:25:03.9521301Z 
2025-05-07T20:25:03.9521307Z 
2025-05-07T20:25:03.9521313Z 
2025-05-07T20:25:03.9521319Z 
2025-05-07T20:25:03.9521325Z 
2025-05-07T20:25:03.9521331Z 
2025-05-07T20:25:03.9521337Z 
2025-05-07T20:25:03.9521343Z 
2025-05-07T20:25:03.9521349Z 
2025-05-07T20:25:03.9521354Z 
2025-05-07T20:25:03.9521360Z 
2025-05-07T20:25:03.9523239Z 
2025-05-07T20:25:03.9686350Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9814531Z python-3.12.2        | 30.8 MB   | ###4       |  34% 
2025-05-07T20:25:03.9814937Z 
2025-05-07T20:25:03.9814943Z 
2025-05-07T20:25:03.9814948Z 
2025-05-07T20:25:03.9814953Z 
2025-05-07T20:25:03.9814958Z 
2025-05-07T20:25:03.9815174Z 
2025-05-07T20:25:03.9821900Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:03.9822268Z 
2025-05-07T20:25:03.9822274Z 
2025-05-07T20:25:03.9822280Z 
2025-05-07T20:25:03.9822285Z 
2025-05-07T20:25:03.9822290Z 
2025-05-07T20:25:03.9822295Z 
2025-05-07T20:25:03.9977519Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:03.9977878Z 
2025-05-07T20:25:03.9977883Z 
2025-05-07T20:25:03.9977887Z 
2025-05-07T20:25:03.9977890Z 
2025-05-07T20:25:03.9977894Z 
2025-05-07T20:25:03.9977898Z 
2025-05-07T20:25:03.9978662Z 
2025-05-07T20:25:03.9988816Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:03.9989122Z 
2025-05-07T20:25:03.9989128Z 
2025-05-07T20:25:03.9989134Z 
2025-05-07T20:25:03.9989139Z 
2025-05-07T20:25:03.9989144Z 
2025-05-07T20:25:03.9989149Z 
2025-05-07T20:25:03.9989200Z 
2025-05-07T20:25:04.0279054Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:04.0279344Z 
2025-05-07T20:25:04.0279348Z 
2025-05-07T20:25:04.0279380Z 
2025-05-07T20:25:04.0279386Z 
2025-05-07T20:25:04.0279391Z 
2025-05-07T20:25:04.0279406Z 
2025-05-07T20:25:04.0279412Z 
2025-05-07T20:25:04.0279417Z 
2025-05-07T20:25:04.0279422Z 
2025-05-07T20:25:04.0279427Z 
2025-05-07T20:25:04.0279433Z 
2025-05-07T20:25:04.0279438Z 
2025-05-07T20:25:04.0283372Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0283849Z 
2025-05-07T20:25:04.0283854Z 
2025-05-07T20:25:04.0283857Z 
2025-05-07T20:25:04.0283861Z 
2025-05-07T20:25:04.0283865Z 
2025-05-07T20:25:04.0283869Z 
2025-05-07T20:25:04.0283884Z 
2025-05-07T20:25:04.0283888Z 
2025-05-07T20:25:04.0283892Z 
2025-05-07T20:25:04.0283895Z 
2025-05-07T20:25:04.0283899Z 
2025-05-07T20:25:04.0283902Z 
2025-05-07T20:25:04.0686777Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.1178847Z python-3.12.2        | 30.8 MB   | #####1     |  51% 
2025-05-07T20:25:04.1182131Z 
2025-05-07T20:25:04.1187588Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:04.1187940Z 
2025-05-07T20:25:04.1686693Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:04.1725621Z python-3.12.2        | 30.8 MB   | ######6    |  67% 
2025-05-07T20:25:04.1725903Z 
2025-05-07T20:25:04.1725908Z 
2025-05-07T20:25:04.1725918Z 
2025-05-07T20:25:04.1725922Z 
2025-05-07T20:25:04.1725926Z 
2025-05-07T20:25:04.1725929Z 
2025-05-07T20:25:04.1725933Z 
2025-05-07T20:25:04.1725939Z 
2025-05-07T20:25:04.1725944Z 
2025-05-07T20:25:04.1728223Z 
2025-05-07T20:25:04.1735914Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.1736236Z 
2025-05-07T20:25:04.1736242Z 
2025-05-07T20:25:04.1736247Z 
2025-05-07T20:25:04.1736252Z 
2025-05-07T20:25:04.1736257Z 
2025-05-07T20:25:04.1736262Z 
2025-05-07T20:25:04.1736267Z 
2025-05-07T20:25:04.1736272Z 
2025-05-07T20:25:04.1736278Z 
2025-05-07T20:25:04.1737979Z 
2025-05-07T20:25:04.2199594Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2200214Z 
2025-05-07T20:25:04.2200218Z 
2025-05-07T20:25:04.2200222Z 
2025-05-07T20:25:04.2200225Z 
2025-05-07T20:25:04.2200229Z 
2025-05-07T20:25:04.2200232Z 
2025-05-07T20:25:04.2200236Z 
2025-05-07T20:25:04.2200239Z 
2025-05-07T20:25:04.2200243Z 
2025-05-07T20:25:04.2200247Z 
2025-05-07T20:25:04.2200250Z 
2025-05-07T20:25:04.2211387Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2211785Z 
2025-05-07T20:25:04.2211789Z 
2025-05-07T20:25:04.2211997Z 
2025-05-07T20:25:04.2212002Z 
2025-05-07T20:25:04.2212006Z 
2025-05-07T20:25:04.2212009Z 
2025-05-07T20:25:04.2212021Z 
2025-05-07T20:25:04.2212025Z 
2025-05-07T20:25:04.2212029Z 
2025-05-07T20:25:04.2212032Z 
2025-05-07T20:25:04.2212036Z 
2025-05-07T20:25:04.2455427Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2455754Z 
2025-05-07T20:25:04.2455761Z 
2025-05-07T20:25:04.2455787Z 
2025-05-07T20:25:04.2455792Z 
2025-05-07T20:25:04.2455799Z 
2025-05-07T20:25:04.2455804Z 
2025-05-07T20:25:04.2455810Z 
2025-05-07T20:25:04.2455815Z 
2025-05-07T20:25:04.2455821Z 
2025-05-07T20:25:04.2463279Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2463581Z 
2025-05-07T20:25:04.2463585Z 
2025-05-07T20:25:04.2463588Z 
2025-05-07T20:25:04.2463592Z 
2025-05-07T20:25:04.2463595Z 
2025-05-07T20:25:04.2463599Z 
2025-05-07T20:25:04.2463602Z 
2025-05-07T20:25:04.2463605Z 
2025-05-07T20:25:04.2464242Z 
2025-05-07T20:25:04.2566343Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2566775Z 
2025-05-07T20:25:04.2566780Z 
2025-05-07T20:25:04.2566783Z 
2025-05-07T20:25:04.2566787Z 
2025-05-07T20:25:04.2566790Z 
2025-05-07T20:25:04.2566794Z 
2025-05-07T20:25:04.2566798Z 
2025-05-07T20:25:04.2566801Z 
2025-05-07T20:25:04.2566805Z 
2025-05-07T20:25:04.2566808Z 
2025-05-07T20:25:04.2566812Z 
2025-05-07T20:25:04.2566822Z 
2025-05-07T20:25:04.2569104Z 
2025-05-07T20:25:04.2585445Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2585858Z 
2025-05-07T20:25:04.2585862Z 
2025-05-07T20:25:04.2585866Z 
2025-05-07T20:25:04.2585870Z 
2025-05-07T20:25:04.2585883Z 
2025-05-07T20:25:04.2585886Z 
2025-05-07T20:25:04.2585890Z 
2025-05-07T20:25:04.2585893Z 
2025-05-07T20:25:04.2585897Z 
2025-05-07T20:25:04.2585900Z 
2025-05-07T20:25:04.2585904Z 
2025-05-07T20:25:04.2585907Z 
2025-05-07T20:25:04.2585911Z 
2025-05-07T20:25:04.2688366Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2808695Z python-3.12.2        | 30.8 MB   | ########1  |  81% 
2025-05-07T20:25:04.2809100Z 
2025-05-07T20:25:04.2809118Z 
2025-05-07T20:25:04.2809124Z 
2025-05-07T20:25:04.2809129Z 
2025-05-07T20:25:04.2809134Z 
2025-05-07T20:25:04.2809140Z 
2025-05-07T20:25:04.2809145Z 
2025-05-07T20:25:04.2809151Z 
2025-05-07T20:25:04.2809171Z 
2025-05-07T20:25:04.2809178Z 
2025-05-07T20:25:04.2809183Z 
2025-05-07T20:25:04.2809190Z 
2025-05-07T20:25:04.2809196Z 
2025-05-07T20:25:04.2809202Z 
2025-05-07T20:25:04.2810468Z 
2025-05-07T20:25:04.2821196Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2821541Z 
2025-05-07T20:25:04.2821547Z 
2025-05-07T20:25:04.2821553Z 
2025-05-07T20:25:04.2821559Z 
2025-05-07T20:25:04.2821565Z 
2025-05-07T20:25:04.2821580Z 
2025-05-07T20:25:04.2821586Z 
2025-05-07T20:25:04.2821591Z 
2025-05-07T20:25:04.2821595Z 
2025-05-07T20:25:04.2821631Z 
2025-05-07T20:25:04.2821636Z 
2025-05-07T20:25:04.2821642Z 
2025-05-07T20:25:04.2821647Z 
2025-05-07T20:25:04.2821652Z 
2025-05-07T20:25:04.2821801Z 
2025-05-07T20:25:04.3172455Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3172877Z 
2025-05-07T20:25:04.3172885Z 
2025-05-07T20:25:04.3172892Z 
2025-05-07T20:25:04.3172901Z 
2025-05-07T20:25:04.3173167Z 
2025-05-07T20:25:04.3173177Z 
2025-05-07T20:25:04.3173187Z 
2025-05-07T20:25:04.3173197Z 
2025-05-07T20:25:04.3173206Z 
2025-05-07T20:25:04.3173212Z 
2025-05-07T20:25:04.3173219Z 
2025-05-07T20:25:04.3173235Z 
2025-05-07T20:25:04.3173240Z 
2025-05-07T20:25:04.3173245Z 
2025-05-07T20:25:04.3179532Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3179972Z 
2025-05-07T20:25:04.3179995Z 
2025-05-07T20:25:04.3180001Z 
2025-05-07T20:25:04.3180006Z 
2025-05-07T20:25:04.3180012Z 
2025-05-07T20:25:04.3180261Z 
2025-05-07T20:25:04.3180267Z 
2025-05-07T20:25:04.3180271Z 
2025-05-07T20:25:04.3180275Z 
2025-05-07T20:25:04.3180278Z 
2025-05-07T20:25:04.3180282Z 
2025-05-07T20:25:04.3180285Z 
2025-05-07T20:25:04.3180289Z 
2025-05-07T20:25:04.3180292Z 
2025-05-07T20:25:04.3472665Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3473001Z 
2025-05-07T20:25:04.3473006Z 
2025-05-07T20:25:04.3473027Z 
2025-05-07T20:25:04.3473032Z 
2025-05-07T20:25:04.3473037Z 
2025-05-07T20:25:04.3473042Z 
2025-05-07T20:25:04.3473048Z 
2025-05-07T20:25:04.3473053Z 
2025-05-07T20:25:04.3473058Z 
2025-05-07T20:25:04.3473063Z 
2025-05-07T20:25:04.3473068Z 
2025-05-07T20:25:04.3473074Z 
2025-05-07T20:25:04.3473079Z 
2025-05-07T20:25:04.3473084Z 
2025-05-07T20:25:04.3473098Z 
2025-05-07T20:25:04.3475288Z 
2025-05-07T20:25:04.3483609Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3484011Z 
2025-05-07T20:25:04.3484040Z 
2025-05-07T20:25:04.3484047Z 
2025-05-07T20:25:04.3484052Z 
2025-05-07T20:25:04.3484058Z 
2025-05-07T20:25:04.3484063Z 
2025-05-07T20:25:04.3484068Z 
2025-05-07T20:25:04.3484073Z 
2025-05-07T20:25:04.3484078Z 
2025-05-07T20:25:04.3484083Z 
2025-05-07T20:25:04.3484088Z 
2025-05-07T20:25:04.3484093Z 
2025-05-07T20:25:04.3484098Z 
2025-05-07T20:25:04.3484103Z 
2025-05-07T20:25:04.3484109Z 
2025-05-07T20:25:04.3487234Z 
2025-05-07T20:25:04.3570539Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3570959Z 
2025-05-07T20:25:04.3570963Z 
2025-05-07T20:25:04.3570966Z 
2025-05-07T20:25:04.3570970Z 
2025-05-07T20:25:04.3570974Z 
2025-05-07T20:25:04.3570977Z 
2025-05-07T20:25:04.3570981Z 
2025-05-07T20:25:04.3570985Z 
2025-05-07T20:25:04.3570988Z 
2025-05-07T20:25:04.3570992Z 
2025-05-07T20:25:04.3570996Z 
2025-05-07T20:25:04.3571005Z 
2025-05-07T20:25:04.3571009Z 
2025-05-07T20:25:04.3571012Z 
2025-05-07T20:25:04.3571016Z 
2025-05-07T20:25:04.3571028Z 
2025-05-07T20:25:04.3571032Z 
2025-05-07T20:25:04.3579115Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3579497Z 
2025-05-07T20:25:04.3579501Z 
2025-05-07T20:25:04.3579505Z 
2025-05-07T20:25:04.3579508Z 
2025-05-07T20:25:04.3579512Z 
2025-05-07T20:25:04.3579515Z 
2025-05-07T20:25:04.3579519Z 
2025-05-07T20:25:04.3579522Z 
2025-05-07T20:25:04.3579535Z 
2025-05-07T20:25:04.3579538Z 
2025-05-07T20:25:04.3579542Z 
2025-05-07T20:25:04.3579545Z 
2025-05-07T20:25:04.3579549Z 
2025-05-07T20:25:04.3579552Z 
2025-05-07T20:25:04.3579556Z 
2025-05-07T20:25:04.3579559Z 
2025-05-07T20:25:04.3580022Z 
2025-05-07T20:25:04.3692742Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3863599Z python-3.12.2        | 30.8 MB   | #########6 |  96% 
2025-05-07T20:25:04.3863950Z 
2025-05-07T20:25:04.3863954Z 
2025-05-07T20:25:04.3863958Z 
2025-05-07T20:25:04.3863971Z 
2025-05-07T20:25:04.3863975Z 
2025-05-07T20:25:04.3863979Z 
2025-05-07T20:25:04.3863982Z 
2025-05-07T20:25:04.3863986Z 
2025-05-07T20:25:04.3863990Z 
2025-05-07T20:25:04.3863993Z 
2025-05-07T20:25:04.3863997Z 
2025-05-07T20:25:04.3864000Z 
2025-05-07T20:25:04.3864004Z 
2025-05-07T20:25:04.3864015Z 
2025-05-07T20:25:04.3864019Z 
2025-05-07T20:25:04.3864022Z 
2025-05-07T20:25:04.3864026Z 
2025-05-07T20:25:04.3864029Z 
2025-05-07T20:25:04.3864245Z 
2025-05-07T20:25:04.3913098Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3913527Z 
2025-05-07T20:25:04.3913533Z 
2025-05-07T20:25:04.3913538Z 
2025-05-07T20:25:04.3913543Z 
2025-05-07T20:25:04.3913548Z 
2025-05-07T20:25:04.3913553Z 
2025-05-07T20:25:04.3913558Z 
2025-05-07T20:25:04.3913563Z 
2025-05-07T20:25:04.3913568Z 
2025-05-07T20:25:04.3913574Z 
2025-05-07T20:25:04.3913579Z 
2025-05-07T20:25:04.3913584Z 
2025-05-07T20:25:04.3913589Z 
2025-05-07T20:25:04.3913594Z 
2025-05-07T20:25:04.3913599Z 
2025-05-07T20:25:04.3913875Z 
2025-05-07T20:25:04.3913883Z 
2025-05-07T20:25:04.3913889Z 
2025-05-07T20:25:04.3918401Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3918866Z 
2025-05-07T20:25:04.3918873Z 
2025-05-07T20:25:04.3918879Z 
2025-05-07T20:25:04.3918885Z 
2025-05-07T20:25:04.3918891Z 
2025-05-07T20:25:04.3918896Z 
2025-05-07T20:25:04.3918911Z 
2025-05-07T20:25:04.3918931Z 
2025-05-07T20:25:04.3918936Z 
2025-05-07T20:25:04.3918942Z 
2025-05-07T20:25:04.3918947Z 
2025-05-07T20:25:04.3918953Z 
2025-05-07T20:25:04.3918957Z 
2025-05-07T20:25:04.3918963Z 
2025-05-07T20:25:04.3918968Z 
2025-05-07T20:25:04.3918973Z 
2025-05-07T20:25:04.3918979Z 
2025-05-07T20:25:04.3918984Z 
2025-05-07T20:25:04.4474347Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.4474667Z 
2025-05-07T20:25:04.4474671Z 
2025-05-07T20:25:04.4487424Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:05.1391500Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:25:05.1398065Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:25:05.1398301Z 
2025-05-07T20:25:05.1398399Z 
2025-05-07T20:25:05.1398404Z 
2025-05-07T20:25:05.1398426Z 
2025-05-07T20:25:05.1398431Z 
2025-05-07T20:25:05.1398434Z 
2025-05-07T20:25:05.1398439Z 
2025-05-07T20:25:05.1398442Z 
2025-05-07T20:25:05.1398464Z 
2025-05-07T20:25:05.1398470Z 
2025-05-07T20:25:05.1398474Z 
2025-05-07T20:25:05.1398478Z 
2025-05-07T20:25:05.1398484Z 
2025-05-07T20:25:05.1398489Z 
2025-05-07T20:25:05.1398495Z 
2025-05-07T20:25:05.1398500Z 
2025-05-07T20:25:05.1398506Z 
2025-05-07T20:25:05.1398510Z 
2025-05-07T20:25:05.1398519Z 
2025-05-07T20:25:05.1398643Z                       
2025-05-07T20:25:05.1399054Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1399381Z                                                      
2025-05-07T20:25:05.1399579Z 
2025-05-07T20:25:05.1399758Z                                                      [A
2025-05-07T20:25:05.1399961Z 
2025-05-07T20:25:05.1399965Z 
2025-05-07T20:25:05.1400130Z                                                      [A[A
2025-05-07T20:25:05.1400355Z 
2025-05-07T20:25:05.1400361Z 
2025-05-07T20:25:05.1400366Z 
2025-05-07T20:25:05.1400625Z                                                      [A[A[A
2025-05-07T20:25:05.1400966Z 
2025-05-07T20:25:05.1400971Z 
2025-05-07T20:25:05.1400976Z 
2025-05-07T20:25:05.1400981Z 
2025-05-07T20:25:05.1401219Z                                                      [A[A[A[A
2025-05-07T20:25:05.1401436Z 
2025-05-07T20:25:05.1401440Z 
2025-05-07T20:25:05.1401443Z 
2025-05-07T20:25:05.1401446Z 
2025-05-07T20:25:05.1401450Z 
2025-05-07T20:25:05.1401624Z                                                      [A[A[A[A[A
2025-05-07T20:25:05.1401839Z 
2025-05-07T20:25:05.1401843Z 
2025-05-07T20:25:05.1401846Z 
2025-05-07T20:25:05.1401850Z 
2025-05-07T20:25:05.1401853Z 
2025-05-07T20:25:05.1401863Z 
2025-05-07T20:25:05.1402040Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:05.1402263Z 
2025-05-07T20:25:05.1402266Z 
2025-05-07T20:25:05.1402270Z 
2025-05-07T20:25:05.1402273Z 
2025-05-07T20:25:05.1402277Z 
2025-05-07T20:25:05.1402280Z 
2025-05-07T20:25:05.1402283Z 
2025-05-07T20:25:05.1402463Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:05.1403003Z 
2025-05-07T20:25:05.1403008Z 
2025-05-07T20:25:05.1403014Z 
2025-05-07T20:25:05.1403019Z 
2025-05-07T20:25:05.1403025Z 
2025-05-07T20:25:05.1403030Z 
2025-05-07T20:25:05.1403036Z 
2025-05-07T20:25:05.1403041Z 
2025-05-07T20:25:05.1403294Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1403597Z 
2025-05-07T20:25:05.1403601Z 
2025-05-07T20:25:05.1403605Z 
2025-05-07T20:25:05.1403608Z 
2025-05-07T20:25:05.1403612Z 
2025-05-07T20:25:05.1403615Z 
2025-05-07T20:25:05.1403619Z 
2025-05-07T20:25:05.1403786Z 
2025-05-07T20:25:05.1403791Z 
2025-05-07T20:25:05.1403994Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1404211Z 
2025-05-07T20:25:05.1404215Z 
2025-05-07T20:25:05.1404218Z 
2025-05-07T20:25:05.1404222Z 
2025-05-07T20:25:05.1404225Z 
2025-05-07T20:25:05.1404228Z 
2025-05-07T20:25:05.1404232Z 
2025-05-07T20:25:05.1404235Z 
2025-05-07T20:25:05.1404239Z 
2025-05-07T20:25:05.1404250Z 
2025-05-07T20:25:05.1404446Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1404665Z 
2025-05-07T20:25:05.1404669Z 
2025-05-07T20:25:05.1404673Z 
2025-05-07T20:25:05.1404676Z 
2025-05-07T20:25:05.1404680Z 
2025-05-07T20:25:05.1404683Z 
2025-05-07T20:25:05.1404687Z 
2025-05-07T20:25:05.1404690Z 
2025-05-07T20:25:05.1404701Z 
2025-05-07T20:25:05.1404705Z 
2025-05-07T20:25:05.1404708Z 
2025-05-07T20:25:05.1404899Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1405126Z 
2025-05-07T20:25:05.1405129Z 
2025-05-07T20:25:05.1405133Z 
2025-05-07T20:25:05.1405142Z 
2025-05-07T20:25:05.1405146Z 
2025-05-07T20:25:05.1405149Z 
2025-05-07T20:25:05.1405153Z 
2025-05-07T20:25:05.1405156Z 
2025-05-07T20:25:05.1405160Z 
2025-05-07T20:25:05.1405163Z 
2025-05-07T20:25:05.1405167Z 
2025-05-07T20:25:05.1405170Z 
2025-05-07T20:25:05.1405362Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1405595Z 
2025-05-07T20:25:05.1405599Z 
2025-05-07T20:25:05.1405602Z 
2025-05-07T20:25:05.1405606Z 
2025-05-07T20:25:05.1405609Z 
2025-05-07T20:25:05.1405613Z 
2025-05-07T20:25:05.1405616Z 
2025-05-07T20:25:05.1405620Z 
2025-05-07T20:25:05.1405623Z 
2025-05-07T20:25:05.1405627Z 
2025-05-07T20:25:05.1405630Z 
2025-05-07T20:25:05.1405634Z 
2025-05-07T20:25:05.1405637Z 
2025-05-07T20:25:05.1405833Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1406071Z 
2025-05-07T20:25:05.1406081Z 
2025-05-07T20:25:05.1406086Z 
2025-05-07T20:25:05.1406090Z 
2025-05-07T20:25:05.1406094Z 
2025-05-07T20:25:05.1406099Z 
2025-05-07T20:25:05.1406103Z 
2025-05-07T20:25:05.1406108Z 
2025-05-07T20:25:05.1406112Z 
2025-05-07T20:25:05.1406117Z 
2025-05-07T20:25:05.1406121Z 
2025-05-07T20:25:05.1406125Z 
2025-05-07T20:25:05.1406130Z 
2025-05-07T20:25:05.1406134Z 
2025-05-07T20:25:05.1406373Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1406605Z 
2025-05-07T20:25:05.1406609Z 
2025-05-07T20:25:05.1406612Z 
2025-05-07T20:25:05.1406616Z 
2025-05-07T20:25:05.1406619Z 
2025-05-07T20:25:05.1406623Z 
2025-05-07T20:25:05.1406627Z 
2025-05-07T20:25:05.1406635Z 
2025-05-07T20:25:05.1406639Z 
2025-05-07T20:25:05.1406642Z 
2025-05-07T20:25:05.1406646Z 
2025-05-07T20:25:05.1406649Z 
2025-05-07T20:25:05.1406653Z 
2025-05-07T20:25:05.1406656Z 
2025-05-07T20:25:05.1406660Z 
2025-05-07T20:25:05.1406874Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1407108Z 
2025-05-07T20:25:05.1407111Z 
2025-05-07T20:25:05.1407115Z 
2025-05-07T20:25:05.1407118Z 
2025-05-07T20:25:05.1407122Z 
2025-05-07T20:25:05.1407125Z 
2025-05-07T20:25:05.1407129Z 
2025-05-07T20:25:05.1407132Z 
2025-05-07T20:25:05.1407136Z 
2025-05-07T20:25:05.1407139Z 
2025-05-07T20:25:05.1407143Z 
2025-05-07T20:25:05.1407146Z 
2025-05-07T20:25:05.1407150Z 
2025-05-07T20:25:05.1407234Z 
2025-05-07T20:25:05.1407237Z 
2025-05-07T20:25:05.1407241Z 
2025-05-07T20:25:05.1407457Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1407690Z 
2025-05-07T20:25:05.1407694Z 
2025-05-07T20:25:05.1407697Z 
2025-05-07T20:25:05.1407701Z 
2025-05-07T20:25:05.1407705Z 
2025-05-07T20:25:05.1407708Z 
2025-05-07T20:25:05.1407712Z 
2025-05-07T20:25:05.1407715Z 
2025-05-07T20:25:05.1407719Z 
2025-05-07T20:25:05.1407722Z 
2025-05-07T20:25:05.1407726Z 
2025-05-07T20:25:05.1407729Z 
2025-05-07T20:25:05.1407817Z 
2025-05-07T20:25:05.1407828Z 
2025-05-07T20:25:05.1407832Z 
2025-05-07T20:25:05.1407835Z 
2025-05-07T20:25:05.1407839Z 
2025-05-07T20:25:05.1408062Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1408298Z 
2025-05-07T20:25:05.1408302Z 
2025-05-07T20:25:05.1408305Z 
2025-05-07T20:25:05.1408309Z 
2025-05-07T20:25:05.1408312Z 
2025-05-07T20:25:05.1408322Z 
2025-05-07T20:25:05.1408325Z 
2025-05-07T20:25:05.1408329Z 
2025-05-07T20:25:05.1408332Z 
2025-05-07T20:25:05.1408336Z 
2025-05-07T20:25:05.1408340Z 
2025-05-07T20:25:05.1408343Z 
2025-05-07T20:25:05.1408347Z 
2025-05-07T20:25:05.1408350Z 
2025-05-07T20:25:05.1408354Z 
2025-05-07T20:25:05.1408357Z 
2025-05-07T20:25:05.1408361Z 
2025-05-07T20:25:05.1408364Z 
2025-05-07T20:25:05.1408588Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.1408825Z 
2025-05-07T20:25:05.1408905Z  done
2025-05-07T20:25:05.2418642Z Preparing transaction: / done
2025-05-07T20:25:05.9788391Z Verifying transaction: \ | / - \ | / done
2025-05-07T20:25:07.5838338Z Executing transaction: \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:25:07.9363274Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:25:09.6794300Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:25:09.6807420Z [SETUP] Installing libxcrypt ...
2025-05-07T20:25:09.6831416Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:25:10.5463743Z Channels:
2025-05-07T20:25:10.5464053Z  - conda-forge
2025-05-07T20:25:10.5464283Z Platform: linux-64
2025-05-07T20:25:13.7959981Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:14.1613126Z Solving environment: \ done
2025-05-07T20:25:14.1983222Z 
2025-05-07T20:25:14.1983907Z # All requested packages already installed.
2025-05-07T20:25:14.1984179Z 
2025-05-07T20:25:17.5667010Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:17.5667724Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h
2025-05-07T20:25:17.5668272Z 
2025-05-07T20:25:17.5700309Z 
2025-05-07T20:25:19.2025148Z [SETUP] Installed Python version: Python 3.12.2
2025-05-07T20:25:19.2025583Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:19.2062617Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:19.2063098Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:19.2077694Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:19.2078045Z env:
2025-05-07T20:25:19.2078273Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:19.2078570Z   BUILD_ENV: build_binary
2025-05-07T20:25:19.2078865Z   BUILD_TARGET: genai
2025-05-07T20:25:19.2079100Z   BUILD_VARIANT: cuda
2025-05-07T20:25:19.2079331Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:19.2079592Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:19.2079891Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:19.2080224Z ##[endgroup]
2025-05-07T20:25:19.5432000Z ################################################################################
2025-05-07T20:25:19.5432341Z # Install C/C++ Compilers
2025-05-07T20:25:19.5432579Z #
2025-05-07T20:25:19.5447680Z # [2025-05-07T20:25:19.544Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:19.5448468Z ################################################################################
2025-05-07T20:25:19.5448747Z 
2025-05-07T20:25:19.5462906Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:19.6360167Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:19.6371101Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:19.6393477Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:20.5019242Z Channels:
2025-05-07T20:25:20.5019556Z  - conda-forge
2025-05-07T20:25:20.5019897Z Platform: linux-64
2025-05-07T20:25:23.9216738Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:24.2932216Z Solving environment: \ done
2025-05-07T20:25:24.3565617Z 
2025-05-07T20:25:24.3566047Z ## Package Plan ##
2025-05-07T20:25:24.3566333Z 
2025-05-07T20:25:24.3566591Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:24.3567026Z 
2025-05-07T20:25:24.3567133Z   added / updated specs:
2025-05-07T20:25:24.3567499Z     - sysroot_linux-64=2.17
2025-05-07T20:25:24.3567750Z 
2025-05-07T20:25:24.3567755Z 
2025-05-07T20:25:24.3567944Z The following packages will be downloaded:
2025-05-07T20:25:24.3568274Z 
2025-05-07T20:25:24.3568456Z     package                    |            build
2025-05-07T20:25:24.3568956Z     ---------------------------|-----------------
2025-05-07T20:25:24.3569618Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:24.3570352Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:24.3570947Z     ------------------------------------------------------------
2025-05-07T20:25:24.3571377Z                                            Total:        15.4 MB
2025-05-07T20:25:24.3571632Z 
2025-05-07T20:25:24.3571776Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:24.3572094Z 
2025-05-07T20:25:24.3572384Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:24.3572952Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:24.3573257Z 
2025-05-07T20:25:24.3573262Z 
2025-05-07T20:25:24.3573266Z 
2025-05-07T20:25:24.3573413Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:24.3573789Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:24.3574015Z 
2025-05-07T20:25:24.5842934Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:24.6005183Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:24.6005429Z 
2025-05-07T20:25:24.6068505Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:24.6069063Z 
2025-05-07T20:25:24.6843324Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:24.7658812Z sysroot_linux-64-2.1 | 14.5 MB   | ######4    |  64% 
2025-05-07T20:25:24.8501717Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:24.8502233Z 
2025-05-07T20:25:24.8502767Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:24.8503266Z 
2025-05-07T20:25:25.3356439Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:25.3360198Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:25.3360845Z                                                      
2025-05-07T20:25:25.3361192Z 
2025-05-07T20:25:25.3361616Z                                                      [A done
2025-05-07T20:25:25.4366050Z Preparing transaction: / done
2025-05-07T20:25:25.6380932Z Verifying transaction: \ | done
2025-05-07T20:25:25.8443564Z Executing transaction: - \ done
2025-05-07T20:25:25.9984155Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:25.9984478Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:27.6702030Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:27.6715264Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:27.6736863Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:28.5610504Z Channels:
2025-05-07T20:25:28.5610828Z  - conda-forge
2025-05-07T20:25:28.5611059Z Platform: linux-64
2025-05-07T20:25:31.8521300Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:32.8090004Z Solving environment: \ | / done
2025-05-07T20:25:32.8743563Z 
2025-05-07T20:25:32.8744438Z ## Package Plan ##
2025-05-07T20:25:32.8744675Z 
2025-05-07T20:25:32.8744981Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:32.8745385Z 
2025-05-07T20:25:32.8745526Z   added / updated specs:
2025-05-07T20:25:32.8745838Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:32.8746006Z 
2025-05-07T20:25:32.8746010Z 
2025-05-07T20:25:32.8746133Z The following packages will be downloaded:
2025-05-07T20:25:32.8746385Z 
2025-05-07T20:25:32.8746511Z     package                    |            build
2025-05-07T20:25:32.8746829Z     ---------------------------|-----------------
2025-05-07T20:25:32.8747235Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:32.8747712Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:32.8748167Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:32.8748600Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:32.8749194Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:32.8749856Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:32.8750505Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:32.8751276Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:32.8752016Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:32.8752693Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:32.8753363Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:32.8753931Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:32.8754408Z     ------------------------------------------------------------
2025-05-07T20:25:32.8754840Z                                            Total:        91.6 MB
2025-05-07T20:25:32.8755112Z 
2025-05-07T20:25:32.8755245Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:32.8755588Z 
2025-05-07T20:25:32.8755861Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:32.8756415Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:32.8757285Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:32.8757798Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:32.8758383Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:32.8758885Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:32.8759406Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:32.8759990Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:32.8760718Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:32.8761594Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:32.8762094Z 
2025-05-07T20:25:32.8762258Z The following packages will be UPDATED:
2025-05-07T20:25:32.8762549Z 
2025-05-07T20:25:32.8762920Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:32.8763849Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:32.8764358Z 
2025-05-07T20:25:32.8764373Z 
2025-05-07T20:25:32.8764377Z 
2025-05-07T20:25:32.8764521Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:32.8764894Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:32.8765126Z 
2025-05-07T20:25:32.8765863Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:32.8766110Z 
2025-05-07T20:25:32.8772483Z 
2025-05-07T20:25:32.8779588Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:32.8779940Z 
2025-05-07T20:25:32.8779944Z 
2025-05-07T20:25:32.8779947Z 
2025-05-07T20:25:32.8818799Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:32.8819092Z 
2025-05-07T20:25:32.8819097Z 
2025-05-07T20:25:32.8819100Z 
2025-05-07T20:25:32.8824730Z 
2025-05-07T20:25:32.8852260Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:32.8852647Z 
2025-05-07T20:25:32.8852653Z 
2025-05-07T20:25:32.8852658Z 
2025-05-07T20:25:32.8852663Z 
2025-05-07T20:25:32.8864854Z 
2025-05-07T20:25:32.8869691Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:32.8870148Z 
2025-05-07T20:25:32.8870154Z 
2025-05-07T20:25:32.8870160Z 
2025-05-07T20:25:32.8870165Z 
2025-05-07T20:25:32.8870171Z 
2025-05-07T20:25:32.8870176Z 
2025-05-07T20:25:32.8871131Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:32.8871429Z 
2025-05-07T20:25:32.8871435Z 
2025-05-07T20:25:32.8871440Z 
2025-05-07T20:25:32.8871445Z 
2025-05-07T20:25:32.8871450Z 
2025-05-07T20:25:32.8871455Z 
2025-05-07T20:25:32.8875108Z 
2025-05-07T20:25:32.8877071Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:32.8877553Z 
2025-05-07T20:25:32.8877559Z 
2025-05-07T20:25:32.8877584Z 
2025-05-07T20:25:32.8877589Z 
2025-05-07T20:25:32.8877594Z 
2025-05-07T20:25:32.8877599Z 
2025-05-07T20:25:32.8877605Z 
2025-05-07T20:25:32.8877610Z 
2025-05-07T20:25:32.8879014Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8879307Z 
2025-05-07T20:25:32.8879316Z 
2025-05-07T20:25:32.8879320Z 
2025-05-07T20:25:32.8879323Z 
2025-05-07T20:25:32.8879327Z 
2025-05-07T20:25:32.8879330Z 
2025-05-07T20:25:32.8879334Z 
2025-05-07T20:25:32.8879337Z 
2025-05-07T20:25:32.8884766Z 
2025-05-07T20:25:32.8889790Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8890236Z 
2025-05-07T20:25:32.8890242Z 
2025-05-07T20:25:32.8890248Z 
2025-05-07T20:25:32.8890253Z 
2025-05-07T20:25:32.8890259Z 
2025-05-07T20:25:32.8890264Z 
2025-05-07T20:25:32.8890270Z 
2025-05-07T20:25:32.8890275Z 
2025-05-07T20:25:32.8890280Z 
2025-05-07T20:25:32.8890296Z 
2025-05-07T20:25:32.8891810Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8892261Z 
2025-05-07T20:25:32.8892267Z 
2025-05-07T20:25:32.8892273Z 
2025-05-07T20:25:32.8892291Z 
2025-05-07T20:25:32.8892297Z 
2025-05-07T20:25:32.8892303Z 
2025-05-07T20:25:32.8892308Z 
2025-05-07T20:25:32.8892313Z 
2025-05-07T20:25:32.8892318Z 
2025-05-07T20:25:32.8892323Z 
2025-05-07T20:25:32.8892329Z 
2025-05-07T20:25:32.9755115Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.9755525Z 
2025-05-07T20:25:32.9777766Z gxx_impl_linux-64-11 | 11.2 MB   | 1          |   1% [A
2025-05-07T20:25:32.9778016Z 
2025-05-07T20:25:32.9778278Z 
2025-05-07T20:25:32.9799880Z libstdcxx-devel_linu | 11.1 MB   | 1          |   2% [A[A
2025-05-07T20:25:32.9800138Z 
2025-05-07T20:25:32.9800143Z 
2025-05-07T20:25:32.9800743Z 
2025-05-07T20:25:32.9831465Z binutils_impl_linux- | 6.0 MB    | 4          |   4% [A[A[A
2025-05-07T20:25:32.9831738Z 
2025-05-07T20:25:32.9832107Z 
2025-05-07T20:25:32.9832305Z 
2025-05-07T20:25:32.9832667Z 
2025-05-07T20:25:33.0492248Z libstdcxx-15.1.0     | 3.7 MB    | #          |  10% [A[A[A[A
2025-05-07T20:25:33.0755702Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:33.0756422Z 
2025-05-07T20:25:33.0778234Z gxx_impl_linux-64-11 | 11.2 MB   | ###3       |  34% [A
2025-05-07T20:25:33.0778507Z 
2025-05-07T20:25:33.0778511Z 
2025-05-07T20:25:33.0806666Z libstdcxx-devel_linu | 11.1 MB   | ###9       |  40% [A[A
2025-05-07T20:25:33.0807026Z 
2025-05-07T20:25:33.0807030Z 
2025-05-07T20:25:33.0807034Z 
2025-05-07T20:25:33.1498128Z binutils_impl_linux- | 6.0 MB    | ######3    |  63% [A[A[A
2025-05-07T20:25:33.1730109Z gcc_impl_linux-64-11 | 53.0 MB   | 4          |   5% 
2025-05-07T20:25:33.1730375Z 
2025-05-07T20:25:33.1730380Z 
2025-05-07T20:25:33.1730384Z 
2025-05-07T20:25:33.1735360Z 
2025-05-07T20:25:33.1741399Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.1741729Z 
2025-05-07T20:25:33.1741755Z 
2025-05-07T20:25:33.1741773Z 
2025-05-07T20:25:33.1743338Z 
2025-05-07T20:25:33.1756360Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.1756756Z 
2025-05-07T20:25:33.1779655Z gxx_impl_linux-64-11 | 11.2 MB   | ######4    |  64% [A
2025-05-07T20:25:33.1780036Z 
2025-05-07T20:25:33.1783953Z 
2025-05-07T20:25:33.2251989Z libstdcxx-devel_linu | 11.1 MB   | #######1   |  71% [A[A
2025-05-07T20:25:33.2252280Z 
2025-05-07T20:25:33.2252284Z 
2025-05-07T20:25:33.2252288Z 
2025-05-07T20:25:33.2252292Z 
2025-05-07T20:25:33.2253500Z 
2025-05-07T20:25:33.2502019Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:33.2756952Z gcc_impl_linux-64-11 | 53.0 MB   | #1         |  11% 
2025-05-07T20:25:33.2757315Z 
2025-05-07T20:25:33.2781659Z gxx_impl_linux-64-11 | 11.2 MB   | ########9  |  90% [A
2025-05-07T20:25:33.2781939Z 
2025-05-07T20:25:33.2783218Z 
2025-05-07T20:25:33.3165913Z libstdcxx-devel_linu | 11.1 MB   | #########9 | 100% [A[A
2025-05-07T20:25:33.3166292Z 
2025-05-07T20:25:33.3166299Z 
2025-05-07T20:25:33.3169567Z 
2025-05-07T20:25:33.3170196Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:33.3170483Z 
2025-05-07T20:25:33.3170488Z 
2025-05-07T20:25:33.3170491Z 
2025-05-07T20:25:33.3253447Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:33.3253733Z 
2025-05-07T20:25:33.3253737Z 
2025-05-07T20:25:33.3253741Z 
2025-05-07T20:25:33.3253744Z 
2025-05-07T20:25:33.3254838Z 
2025-05-07T20:25:33.3502128Z libsanitizer-11.4.0  | 3.5 MB    | #########  |  90% [A[A[A[A[A
2025-05-07T20:25:33.3601808Z gcc_impl_linux-64-11 | 53.0 MB   | #7         |  18% 
2025-05-07T20:25:33.3602053Z 
2025-05-07T20:25:33.3602057Z 
2025-05-07T20:25:33.3602060Z 
2025-05-07T20:25:33.3602064Z 
2025-05-07T20:25:33.3602076Z 
2025-05-07T20:25:33.3605254Z 
2025-05-07T20:25:33.4375278Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:33.4375581Z 
2025-05-07T20:25:33.4375605Z 
2025-05-07T20:25:33.4375899Z 
2025-05-07T20:25:33.4375904Z 
2025-05-07T20:25:33.4378132Z 
2025-05-07T20:25:33.4800124Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:33.4800413Z 
2025-05-07T20:25:33.4800589Z 
2025-05-07T20:25:33.4800595Z 
2025-05-07T20:25:33.4800598Z 
2025-05-07T20:25:33.4800602Z 
2025-05-07T20:25:33.4803324Z 
2025-05-07T20:25:33.4803931Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.4804267Z 
2025-05-07T20:25:33.4804274Z 
2025-05-07T20:25:33.4804475Z 
2025-05-07T20:25:33.4804479Z 
2025-05-07T20:25:33.4804483Z 
2025-05-07T20:25:33.4804518Z 
2025-05-07T20:25:33.4814583Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.4908502Z gcc_impl_linux-64-11 | 53.0 MB   | ##2        |  23% 
2025-05-07T20:25:33.4908757Z 
2025-05-07T20:25:33.4908762Z 
2025-05-07T20:25:33.4908765Z 
2025-05-07T20:25:33.4908769Z 
2025-05-07T20:25:33.4908773Z 
2025-05-07T20:25:33.4908783Z 
2025-05-07T20:25:33.4909004Z 
2025-05-07T20:25:33.5216566Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:33.5216867Z 
2025-05-07T20:25:33.5216871Z 
2025-05-07T20:25:33.5216875Z 
2025-05-07T20:25:33.5216885Z 
2025-05-07T20:25:33.5216889Z 
2025-05-07T20:25:33.5216893Z 
2025-05-07T20:25:33.5216896Z 
2025-05-07T20:25:33.5218096Z 
2025-05-07T20:25:33.5277597Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5277904Z 
2025-05-07T20:25:33.5277908Z 
2025-05-07T20:25:33.5277911Z 
2025-05-07T20:25:33.5277915Z 
2025-05-07T20:25:33.5277918Z 
2025-05-07T20:25:33.5277922Z 
2025-05-07T20:25:33.5277925Z 
2025-05-07T20:25:33.5279965Z 
2025-05-07T20:25:33.5381755Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5382058Z 
2025-05-07T20:25:33.5382062Z 
2025-05-07T20:25:33.5382066Z 
2025-05-07T20:25:33.5382069Z 
2025-05-07T20:25:33.5382073Z 
2025-05-07T20:25:33.5382077Z 
2025-05-07T20:25:33.5382092Z 
2025-05-07T20:25:33.5680263Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:33.5680566Z 
2025-05-07T20:25:33.5680571Z 
2025-05-07T20:25:33.5680575Z 
2025-05-07T20:25:33.5680578Z 
2025-05-07T20:25:33.5680582Z 
2025-05-07T20:25:33.5680585Z 
2025-05-07T20:25:33.5680589Z 
2025-05-07T20:25:33.5680593Z 
2025-05-07T20:25:33.5680596Z 
2025-05-07T20:25:33.5705493Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5705778Z 
2025-05-07T20:25:33.5705782Z 
2025-05-07T20:25:33.5705785Z 
2025-05-07T20:25:33.5705789Z 
2025-05-07T20:25:33.5705800Z 
2025-05-07T20:25:33.5705803Z 
2025-05-07T20:25:33.5705807Z 
2025-05-07T20:25:33.5705811Z 
2025-05-07T20:25:33.5709574Z 
2025-05-07T20:25:33.5951792Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5952187Z 
2025-05-07T20:25:33.5952191Z 
2025-05-07T20:25:33.5952194Z 
2025-05-07T20:25:33.5952198Z 
2025-05-07T20:25:33.5952202Z 
2025-05-07T20:25:33.5952219Z 
2025-05-07T20:25:33.5952223Z 
2025-05-07T20:25:33.5952226Z 
2025-05-07T20:25:33.5952230Z 
2025-05-07T20:25:33.5953878Z 
2025-05-07T20:25:33.5979236Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5979581Z 
2025-05-07T20:25:33.5979585Z 
2025-05-07T20:25:33.5982184Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:33.5982458Z 
2025-05-07T20:25:33.5982462Z 
2025-05-07T20:25:33.5982466Z 
2025-05-07T20:25:33.5982470Z 
2025-05-07T20:25:33.5982474Z 
2025-05-07T20:25:33.5982478Z 
2025-05-07T20:25:33.5982488Z 
2025-05-07T20:25:33.5982491Z 
2025-05-07T20:25:33.5982495Z 
2025-05-07T20:25:33.5982498Z 
2025-05-07T20:25:33.6033256Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6033863Z 
2025-05-07T20:25:33.6086807Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:33.6087170Z 
2025-05-07T20:25:33.6087174Z 
2025-05-07T20:25:33.6087190Z 
2025-05-07T20:25:33.6087402Z 
2025-05-07T20:25:33.6087407Z 
2025-05-07T20:25:33.6087411Z 
2025-05-07T20:25:33.6087414Z 
2025-05-07T20:25:33.6087418Z 
2025-05-07T20:25:33.6087422Z 
2025-05-07T20:25:33.6087425Z 
2025-05-07T20:25:33.6087984Z 
2025-05-07T20:25:33.6095937Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6096367Z 
2025-05-07T20:25:33.6096371Z 
2025-05-07T20:25:33.6096375Z 
2025-05-07T20:25:33.6096378Z 
2025-05-07T20:25:33.6096382Z 
2025-05-07T20:25:33.6096385Z 
2025-05-07T20:25:33.6096389Z 
2025-05-07T20:25:33.6096392Z 
2025-05-07T20:25:33.6096396Z 
2025-05-07T20:25:33.6096400Z 
2025-05-07T20:25:33.6096403Z 
2025-05-07T20:25:33.6180392Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6180693Z 
2025-05-07T20:25:33.6180696Z 
2025-05-07T20:25:33.6180700Z 
2025-05-07T20:25:33.6182485Z 
2025-05-07T20:25:33.7063746Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.7064244Z 
2025-05-07T20:25:33.7064248Z 
2025-05-07T20:25:33.7064252Z 
2025-05-07T20:25:33.7064256Z 
2025-05-07T20:25:33.7064259Z 
2025-05-07T20:25:33.7064263Z 
2025-05-07T20:25:33.7073705Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.7073991Z 
2025-05-07T20:25:33.7073995Z 
2025-05-07T20:25:33.7073999Z 
2025-05-07T20:25:33.7074002Z 
2025-05-07T20:25:33.7074012Z 
2025-05-07T20:25:33.7416184Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:33.7653368Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  28% 
2025-05-07T20:25:33.7653611Z 
2025-05-07T20:25:33.7653616Z 
2025-05-07T20:25:33.7653619Z 
2025-05-07T20:25:33.7653623Z 
2025-05-07T20:25:33.7653626Z 
2025-05-07T20:25:33.7653630Z 
2025-05-07T20:25:33.7653634Z 
2025-05-07T20:25:33.7653637Z 
2025-05-07T20:25:33.7663675Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.7663964Z 
2025-05-07T20:25:33.7663967Z 
2025-05-07T20:25:33.7663987Z 
2025-05-07T20:25:33.7663991Z 
2025-05-07T20:25:33.7663995Z 
2025-05-07T20:25:33.7663999Z 
2025-05-07T20:25:33.7664334Z 
2025-05-07T20:25:33.7669069Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:33.7669359Z 
2025-05-07T20:25:33.7669363Z 
2025-05-07T20:25:33.7669374Z 
2025-05-07T20:25:33.7669378Z 
2025-05-07T20:25:33.7669381Z 
2025-05-07T20:25:33.7669385Z 
2025-05-07T20:25:33.7669389Z 
2025-05-07T20:25:33.7669393Z 
2025-05-07T20:25:33.7674277Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.7674686Z 
2025-05-07T20:25:33.7674692Z 
2025-05-07T20:25:33.7674697Z 
2025-05-07T20:25:33.7674703Z 
2025-05-07T20:25:33.7674708Z 
2025-05-07T20:25:33.7674713Z 
2025-05-07T20:25:33.7674718Z 
2025-05-07T20:25:33.8187230Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:33.8187614Z 
2025-05-07T20:25:33.8187620Z 
2025-05-07T20:25:33.8187625Z 
2025-05-07T20:25:33.8187643Z 
2025-05-07T20:25:33.8187655Z 
2025-05-07T20:25:33.8187661Z 
2025-05-07T20:25:33.8187665Z 
2025-05-07T20:25:33.8187671Z 
2025-05-07T20:25:33.8187675Z 
2025-05-07T20:25:33.8197043Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8197433Z 
2025-05-07T20:25:33.8197437Z 
2025-05-07T20:25:33.8197440Z 
2025-05-07T20:25:33.8197444Z 
2025-05-07T20:25:33.8197447Z 
2025-05-07T20:25:33.8197450Z 
2025-05-07T20:25:33.8197454Z 
2025-05-07T20:25:33.8197457Z 
2025-05-07T20:25:33.8198537Z 
2025-05-07T20:25:33.8417453Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8885526Z gcc_impl_linux-64-11 | 53.0 MB   | ###7       |  37% 
2025-05-07T20:25:33.8885792Z 
2025-05-07T20:25:33.8885796Z 
2025-05-07T20:25:33.8885800Z 
2025-05-07T20:25:33.8885803Z 
2025-05-07T20:25:33.8885807Z 
2025-05-07T20:25:33.8885810Z 
2025-05-07T20:25:33.8885814Z 
2025-05-07T20:25:33.8885908Z 
2025-05-07T20:25:33.8885912Z 
2025-05-07T20:25:33.8885963Z 
2025-05-07T20:25:33.8894538Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8894915Z 
2025-05-07T20:25:33.8894918Z 
2025-05-07T20:25:33.8894922Z 
2025-05-07T20:25:33.8894926Z 
2025-05-07T20:25:33.8894929Z 
2025-05-07T20:25:33.8894933Z 
2025-05-07T20:25:33.8894936Z 
2025-05-07T20:25:33.8894940Z 
2025-05-07T20:25:33.8894944Z 
2025-05-07T20:25:33.8894954Z 
2025-05-07T20:25:33.9418172Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.0418441Z gcc_impl_linux-64-11 | 53.0 MB   | ####9      |  50% 
2025-05-07T20:25:34.0515512Z gcc_impl_linux-64-11 | 53.0 MB   | ######4    |  64% 
2025-05-07T20:25:34.0515754Z 
2025-05-07T20:25:34.0515758Z 
2025-05-07T20:25:34.0516351Z 
2025-05-07T20:25:34.1044797Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:34.1045259Z 
2025-05-07T20:25:34.1045269Z 
2025-05-07T20:25:34.1045278Z 
2025-05-07T20:25:34.1045287Z 
2025-05-07T20:25:34.1045610Z 
2025-05-07T20:25:34.1045639Z 
2025-05-07T20:25:34.1045648Z 
2025-05-07T20:25:34.1045658Z 
2025-05-07T20:25:34.1045683Z 
2025-05-07T20:25:34.1045692Z 
2025-05-07T20:25:34.1045763Z 
2025-05-07T20:25:34.1056456Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.1056937Z 
2025-05-07T20:25:34.1056946Z 
2025-05-07T20:25:34.1056951Z 
2025-05-07T20:25:34.1056956Z 
2025-05-07T20:25:34.1056961Z 
2025-05-07T20:25:34.1056966Z 
2025-05-07T20:25:34.1056971Z 
2025-05-07T20:25:34.1056976Z 
2025-05-07T20:25:34.1056982Z 
2025-05-07T20:25:34.1056987Z 
2025-05-07T20:25:34.1057201Z 
2025-05-07T20:25:34.1419732Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.2419754Z gcc_impl_linux-64-11 | 53.0 MB   | #######8   |  79% 
2025-05-07T20:25:34.3228784Z gcc_impl_linux-64-11 | 53.0 MB   | #########6 |  97% 
2025-05-07T20:25:34.3229034Z 
2025-05-07T20:25:34.3703111Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:34.5825307Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:34.5825720Z 
2025-05-07T20:25:34.5825727Z 
2025-05-07T20:25:35.0426530Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:35.0432657Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:35.0433039Z                                                      
2025-05-07T20:25:35.0433237Z 
2025-05-07T20:25:35.0433452Z                                                      [A
2025-05-07T20:25:35.0433663Z 
2025-05-07T20:25:35.0433667Z 
2025-05-07T20:25:35.0433838Z                                                      [A[A
2025-05-07T20:25:35.0434051Z 
2025-05-07T20:25:35.0434055Z 
2025-05-07T20:25:35.0434059Z 
2025-05-07T20:25:35.0434229Z                                                      [A[A[A
2025-05-07T20:25:35.0434449Z 
2025-05-07T20:25:35.0434454Z 
2025-05-07T20:25:35.0434459Z 
2025-05-07T20:25:35.0434464Z 
2025-05-07T20:25:35.0434730Z                                                      [A[A[A[A
2025-05-07T20:25:35.0435026Z 
2025-05-07T20:25:35.0435031Z 
2025-05-07T20:25:35.0435036Z 
2025-05-07T20:25:35.0435041Z 
2025-05-07T20:25:35.0435070Z 
2025-05-07T20:25:35.0435310Z                                                      [A[A[A[A[A
2025-05-07T20:25:35.0435723Z 
2025-05-07T20:25:35.0435726Z 
2025-05-07T20:25:35.0435730Z 
2025-05-07T20:25:35.0435733Z 
2025-05-07T20:25:35.0435737Z 
2025-05-07T20:25:35.0435747Z 
2025-05-07T20:25:35.0443801Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:35.0444125Z 
2025-05-07T20:25:35.0444131Z 
2025-05-07T20:25:35.0444137Z 
2025-05-07T20:25:35.0444142Z 
2025-05-07T20:25:35.0444147Z 
2025-05-07T20:25:35.0444152Z 
2025-05-07T20:25:35.0444157Z 
2025-05-07T20:25:35.0444480Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:35.0444786Z 
2025-05-07T20:25:35.0444792Z 
2025-05-07T20:25:35.0444797Z 
2025-05-07T20:25:35.0444802Z 
2025-05-07T20:25:35.0444823Z 
2025-05-07T20:25:35.0445106Z 
2025-05-07T20:25:35.0445113Z 
2025-05-07T20:25:35.0445118Z 
2025-05-07T20:25:35.0445392Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:35.0445612Z 
2025-05-07T20:25:35.0445616Z 
2025-05-07T20:25:35.0445619Z 
2025-05-07T20:25:35.0445623Z 
2025-05-07T20:25:35.0445627Z 
2025-05-07T20:25:35.0445630Z 
2025-05-07T20:25:35.0445634Z 
2025-05-07T20:25:35.0445637Z 
2025-05-07T20:25:35.0445641Z 
2025-05-07T20:25:35.0445838Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.0446054Z 
2025-05-07T20:25:35.0446058Z 
2025-05-07T20:25:35.0446062Z 
2025-05-07T20:25:35.0446065Z 
2025-05-07T20:25:35.0446069Z 
2025-05-07T20:25:35.0446072Z 
2025-05-07T20:25:35.0446076Z 
2025-05-07T20:25:35.0446079Z 
2025-05-07T20:25:35.0446091Z 
2025-05-07T20:25:35.0446094Z 
2025-05-07T20:25:35.0446282Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.0446650Z 
2025-05-07T20:25:35.0446659Z 
2025-05-07T20:25:35.0446663Z 
2025-05-07T20:25:35.0446666Z 
2025-05-07T20:25:35.0446677Z 
2025-05-07T20:25:35.0446681Z 
2025-05-07T20:25:35.0446685Z 
2025-05-07T20:25:35.0446688Z 
2025-05-07T20:25:35.0446692Z 
2025-05-07T20:25:35.0446695Z 
2025-05-07T20:25:35.0446699Z 
2025-05-07T20:25:35.0446908Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:35.1440909Z Preparing transaction: \ done
2025-05-07T20:25:35.4452756Z Verifying transaction: / - \ done
2025-05-07T20:25:35.5461572Z Executing transaction: / done
2025-05-07T20:25:35.7114447Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:39.6053941Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:39.6054484Z 
2025-05-07T20:25:39.6068681Z 
2025-05-07T20:25:39.6086352Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:39.6086892Z 
2025-05-07T20:25:39.6098103Z 
2025-05-07T20:25:39.6115259Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:39.6115865Z 
2025-05-07T20:25:39.6128456Z 
2025-05-07T20:25:39.6145756Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:39.6146286Z 
2025-05-07T20:25:39.6157713Z 
2025-05-07T20:25:41.5031424Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:41.5031718Z 
2025-05-07T20:25:41.5644856Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:43.4477391Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:43.4477686Z 
2025-05-07T20:25:43.5088988Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:45.3903067Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:45.3903609Z 
2025-05-07T20:25:45.4516361Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:47.3330310Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:47.3330613Z 
2025-05-07T20:25:47.3967322Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:47.3971350Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:47.3971955Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:47.3972236Z 
2025-05-07T20:25:49.2890375Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:49.2890888Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:49.2891337Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:49.2891772Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:49.2892321Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:49.2892850Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:49.2893291Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:49.2893746Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:49.2894144Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:49.2895034Z #define __CHAR_BIT__ 8
2025-05-07T20:25:49.2895418Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:49.2895769Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:49.2896133Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:49.2896541Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:49.2896961Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:49.2897418Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.2897894Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:49.2898324Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:49.2898835Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:49.2899350Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:49.2899967Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:49.2900564Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:49.2901037Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:49.2901444Z #define __GCC_IEC_559 2
2025-05-07T20:25:49.2901803Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:49.2902466Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:49.2902865Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:49.2903272Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:49.2903780Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.2904267Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:49.2904666Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.2905078Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:49.2905475Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:49.2905879Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:49.2906279Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:49.2906690Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:49.2907075Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:49.2907442Z #define __INT8_C(c) c
2025-05-07T20:25:49.2907799Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:49.2908235Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.2908711Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:49.2909220Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:49.2909758Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:49.2910150Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:49.2910540Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.2910956Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:49.2911373Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:49.2911955Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:49.2912566Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:49.2912996Z #define __linux 1
2025-05-07T20:25:49.2913314Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:49.2913718Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:49.2914128Z #define __unix 1
2025-05-07T20:25:49.2914457Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:49.2914883Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:49.2915312Z #define __WINT_MIN__ 0U
2025-05-07T20:25:49.2915862Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.2916284Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:49.2916701Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:49.2917106Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:49.2917486Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:49.2917904Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:49.2918375Z #define __INT64_C(c) c ## L
2025-05-07T20:25:49.2918759Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:49.2919208Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:49.2919605Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:49.2920096Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:49.2920646Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:49.2921020Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:49.2921392Z #define __DBL_DIG__ 15
2025-05-07T20:25:49.2921733Z #define __FLT32_DIG__ 6
2025-05-07T20:25:49.2922182Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:49.2922873Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:49.2923255Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:49.2923741Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:49.2924241Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:49.2924611Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:49.2925000Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:49.2925578Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:49.2926175Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:49.2926592Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:49.2926985Z #define __unix__ 1
2025-05-07T20:25:49.2927299Z #define __INT_WIDTH__ 32
2025-05-07T20:25:49.2927657Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:49.2928020Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:49.2928385Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:49.2928781Z #define __UINT16_C(c) c
2025-05-07T20:25:49.2929144Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:49.2929676Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:49.2930276Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:49.2930832Z #define __gnu_linux__ 1
2025-05-07T20:25:49.2931198Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:49.2931624Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.2932059Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.2932464Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:49.2932848Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:49.2933222Z #define __GNUC__ 11
2025-05-07T20:25:49.2933542Z #define __pie__ 2
2025-05-07T20:25:49.2933849Z #define __MMX__ 1
2025-05-07T20:25:49.2934176Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:49.2934574Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:49.2934990Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:49.2935401Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:49.2935908Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:49.2936519Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2936990Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:49.2937373Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:49.2937775Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:49.2938220Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:49.2938607Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:49.2938985Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:49.2939405Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:49.2939838Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:49.2940222Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:49.2940644Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:49.2941026Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:49.2941405Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:49.2941818Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:49.2942202Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:49.2942599Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:49.2943073Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:49.2943623Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:49.2944015Z #define __SSE2_MATH__ 1
2025-05-07T20:25:49.2944370Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:49.2944816Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2945255Z #define __amd64 1
2025-05-07T20:25:49.2945574Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:49.2945983Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:49.2946433Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:49.2946881Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:49.2947262Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:49.2947680Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:49.2948050Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:49.2948440Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:49.2948822Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:49.2949210Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:49.2949613Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:49.2950255Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:49.2950635Z #define __x86_64 1
2025-05-07T20:25:49.2950975Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:49.2951500Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:49.2952201Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:49.2952866Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:49.2953550Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:49.2954121Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:49.2954496Z #define __LP64__ 1
2025-05-07T20:25:49.2954835Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.2955346Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:49.2956099Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:49.2956527Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:49.2956934Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.2957481Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:49.2957891Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:49.2958280Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:49.2958655Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:49.2959033Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:49.2959411Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:49.2959924Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:49.2960472Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:49.2960867Z #define __FLT_DIG__ 6
2025-05-07T20:25:49.2961200Z #define __NO_INLINE__ 1
2025-05-07T20:25:49.2961549Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:49.2962027Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:49.2962536Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:49.2962914Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:49.2963296Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:49.2963662Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:49.2964050Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:49.2964417Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:49.2964992Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:49.2965740Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:49.2966145Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:49.2966603Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:49.2967091Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:49.2967479Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:49.2967860Z #define __FLT128_DIG__ 33
2025-05-07T20:25:49.2968218Z #define __INT32_C(c) c
2025-05-07T20:25:49.2968578Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:49.2969002Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:49.2969416Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:49.2969834Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:49.2970309Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:49.2970770Z #define unix 1
2025-05-07T20:25:49.2971128Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:49.2971590Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.2972037Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:49.2972482Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:49.2972958Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:49.2973329Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:49.2973710Z #define __ELF__ 1
2025-05-07T20:25:49.2974030Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:49.2974430Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:49.2974830Z #define __FLT_RADIX__ 2
2025-05-07T20:25:49.2975197Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:49.2975743Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:49.2976275Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:49.2976655Z #define __SSE_MATH__ 1
2025-05-07T20:25:49.2976978Z #define __k8 1
2025-05-07T20:25:49.2977717Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:49.2978263Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:49.2978675Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:49.2979099Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:49.2979468Z #define __LDBL_DIG__ 18
2025-05-07T20:25:49.2979853Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:49.2980220Z #define __x86_64__ 1
2025-05-07T20:25:49.2980559Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:49.2980971Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:49.2981450Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2981879Z #define __FLT64_DIG__ 15
2025-05-07T20:25:49.2982274Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.2982779Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:49.2983235Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.2983607Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:49.2983984Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2984668Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:49.2985179Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:49.2985731Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:49.2986136Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:49.2986614Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:49.2987068Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:49.2987494Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:49.2987893Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:49.2988322Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:49.2988722Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:49.2989060Z #define __SEG_FS 1
2025-05-07T20:25:49.2989383Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:49.2989798Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:49.2990193Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2990601Z #define __SEG_GS 1
2025-05-07T20:25:49.2991039Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:49.2991588Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:49.2991969Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:49.2992372Z #define __INT16_TYPE__ short int
2025-05-07T20:25:49.2992766Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:49.2993180Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:49.2993545Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:49.2993887Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:49.2994260Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:49.2994730Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:49.2995290Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.2995821Z #define linux 1
2025-05-07T20:25:49.2996141Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.2996525Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:49.2996912Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:49.2997267Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:49.2997647Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:49.2998014Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:49.2998502Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:49.2999070Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:49.2999531Z #define __code_model_small__ 1
2025-05-07T20:25:49.2999913Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:49.3000313Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:49.3000656Z #define __k8__ 1
2025-05-07T20:25:49.3000972Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:49.3001370Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:49.3001786Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:49.3002134Z #define __pic__ 2
2025-05-07T20:25:49.3002485Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.3002928Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:49.3003343Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.3003822Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:49.3004496Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:49.3005013Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:49.3005398Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:49.3005806Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:49.3006250Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:49.3006605Z #define __linux__ 1
2025-05-07T20:25:49.3006924Z #define __INT64_TYPE__ long int
2025-05-07T20:25:49.3007293Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:49.3007660Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:49.3008036Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:49.3008386Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:49.3008797Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.3009261Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:49.3009671Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:49.3010098Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:49.3010523Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:49.3011126Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:49.3011595Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:49.3012113Z #define __SSE__ 1
2025-05-07T20:25:49.3012429Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:49.3012909Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:49.3013410Z #define __amd64__ 1
2025-05-07T20:25:49.3013713Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:49.3014065Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:49.3014444Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:49.3014830Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:49.3015201Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:49.3015583Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:49.3015952Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:49.3016328Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:49.3016705Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:49.3017206Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:49.3017876Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:49.3018395Z #define _LP64 1
2025-05-07T20:25:49.3018704Z #define __UINT8_C(c) c
2025-05-07T20:25:49.3019031Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:49.3019404Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:49.3019810Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:49.3020198Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:49.3020620Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:49.3021120Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:49.3021770Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:49.3022288Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.3022710Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.3023149Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:49.3023665Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:49.3024195Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:49.3024570Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:49.3025033Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:49.3025556Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:49.3025922Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:49.3026276Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:49.3026621Z #define __FXSR__ 1
2025-05-07T20:25:49.3027045Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:49.3027688Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:49.3028272Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:49.3028705Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:49.3029061Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:49.3029523Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:49.3030193Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:49.3030555Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:49.3030879Z #define __PIC__ 2
2025-05-07T20:25:49.3031230Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:49.3031807Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:49.3032364Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:49.3032838Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:49.3033335Z #define __SSE2__ 1
2025-05-07T20:25:49.3033662Z #define __INT32_TYPE__ int
2025-05-07T20:25:49.3033998Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:49.3034364Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:49.3034839Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:49.3035336Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:49.3035850Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:49.3036234Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:49.3036626Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.3048011Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:49.3048434Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:49.3048781Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:49.3049212Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.3049652Z #define __PIE__ 2
2025-05-07T20:25:49.3050108Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:49.3050691Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:49.3051197Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:49.3051720Z #define __INT16_C(c) c
2025-05-07T20:25:49.3052031Z #define __STDC__ 1
2025-05-07T20:25:49.3052360Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:49.3052743Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:49.3053102Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.3053532Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:49.3054032Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:49.3054528Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:49.3054903Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.3055305Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:49.3055689Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:49.3056095Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:49.3056513Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.3056918Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:49.3057327Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.3057898Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:49.3058437Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:49.3058860Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:49.3059286Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:49.3059643Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:49.3059865Z 
2025-05-07T20:25:49.3563859Z 
2025-05-07T20:25:49.3564622Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:49.3565125Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:49.3566257Z 
2025-05-07T20:25:51.2440452Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:51.2440810Z #define __cpp_attributes 200809L
2025-05-07T20:25:51.2441161Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:51.2441519Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:51.2441809Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:51.2442071Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:51.2442415Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:51.2442775Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:51.2443056Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:51.2443379Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:51.2443695Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:51.2443961Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:51.2444215Z #define __CHAR_BIT__ 8
2025-05-07T20:25:51.2444457Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:51.2445056Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:51.2445319Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:51.2445592Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:51.2445872Z #define __cpp_static_assert 201411L
2025-05-07T20:25:51.2446160Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:51.2446463Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2446771Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:51.2447058Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:51.2447389Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:51.2447721Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:51.2448120Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:51.2448538Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:51.2448856Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:51.2449138Z #define __GCC_IEC_559 2
2025-05-07T20:25:51.2449395Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:51.2449846Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:51.2450149Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:51.2450468Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:51.2450769Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:51.2451095Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:51.2451404Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:51.2451738Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2452073Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:51.2452347Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.2452632Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:51.2452920Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:51.2453217Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:51.2453492Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:51.2453829Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:51.2454114Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:51.2454450Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:51.2454789Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:51.2455054Z #define __INT8_C(c) c
2025-05-07T20:25:51.2455291Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:51.2455567Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:51.2455895Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2456219Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:51.2456503Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:51.2456802Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:51.2457115Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:51.2457473Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:51.2457765Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:51.2458046Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:51.2458317Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2458602Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:51.2458883Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:51.2459286Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:51.2459701Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:51.2459996Z #define __linux 1
2025-05-07T20:25:51.2460248Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:51.2460555Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:51.2460838Z #define __unix 1
2025-05-07T20:25:51.2461062Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:51.2461350Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:51.2461645Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:51.2461916Z #define __WINT_MIN__ 0U
2025-05-07T20:25:51.2462172Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.2462462Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:51.2462749Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:51.2463020Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:51.2463279Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:51.2463570Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:51.2463971Z #define __INT64_C(c) c ## L
2025-05-07T20:25:51.2464247Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:51.2464553Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:51.2464827Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:51.2465137Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:51.2465783Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:51.2466087Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:51.2466446Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:51.2466830Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:51.2467084Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:51.2467371Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:51.2467657Z #define __DBL_DIG__ 15
2025-05-07T20:25:51.2467894Z #define __FLT32_DIG__ 6
2025-05-07T20:25:51.2468194Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:51.2468550Z #define __GXX_WEAK__ 1
2025-05-07T20:25:51.2468795Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:51.2469219Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:51.2469583Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:51.2469978Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:51.2470262Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:51.2470643Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:51.2471011Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:51.2471469Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:51.2471928Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:51.2472231Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:51.2472516Z #define __unix__ 1
2025-05-07T20:25:51.2472749Z #define __INT_WIDTH__ 32
2025-05-07T20:25:51.2473018Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:51.2473287Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:51.2473574Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:51.2473862Z #define __UINT16_C(c) c
2025-05-07T20:25:51.2474134Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:51.2474415Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:51.2474811Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:51.2475227Z #define __gnu_linux__ 1
2025-05-07T20:25:51.2475487Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:51.2475844Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:51.2476135Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.2476427Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2476704Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:51.2476965Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:51.2477225Z #define __GNUC__ 11
2025-05-07T20:25:51.2477448Z #define __GXX_RTTI 1
2025-05-07T20:25:51.2477674Z #define __pie__ 2
2025-05-07T20:25:51.2477894Z #define __MMX__ 1
2025-05-07T20:25:51.2478121Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:51.2478383Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:51.2478668Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:51.2478946Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:51.2479194Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:51.2479495Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:51.2479820Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:51.2480184Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:51.2480591Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:51.2480899Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2481218Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:51.2481479Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:51.2481747Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:51.2482058Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:51.2482368Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:51.2482630Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:51.2482895Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:51.2483189Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:51.2483485Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:51.2483937Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:51.2484230Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:51.2484483Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:51.2484755Z #define __cplusplus 201703L
2025-05-07T20:25:51.2485027Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:51.2485312Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:51.2485573Z #define __DEPRECATED 1
2025-05-07T20:25:51.2485830Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:51.2486131Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:51.2486385Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:51.2486709Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:51.2487071Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:51.2487338Z #define __SSE2_MATH__ 1
2025-05-07T20:25:51.2487591Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:51.2487897Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2488186Z #define __amd64 1
2025-05-07T20:25:51.2488518Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:51.2488789Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:51.2489054Z #define __GNUG__ 11
2025-05-07T20:25:51.2489314Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:51.2489629Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:51.2489882Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:51.2490177Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:51.2490474Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:51.2490733Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:51.2491004Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:51.2491302Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:51.2491571Z #define __cpp_hex_float 201603L
2025-05-07T20:25:51.2491837Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:51.2492106Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:51.2492383Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:51.2492645Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:51.2492917Z #define __x86_64 1
2025-05-07T20:25:51.2493162Z #define __cpp_lambdas 200907L
2025-05-07T20:25:51.2493429Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:51.2493801Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:51.2494194Z #define __cpp_template_auto 201606L
2025-05-07T20:25:51.2494546Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:51.2495002Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:51.2495471Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:51.2495861Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:51.2496112Z #define __LP64__ 1
2025-05-07T20:25:51.2496346Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2496702Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:51.2497078Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:51.2497360Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.2497650Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:51.2497930Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:51.2498207Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:51.2498472Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:51.2498734Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:51.2499066Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:51.2499431Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:51.2499718Z #define __FLT_DIG__ 6
2025-05-07T20:25:51.2499948Z #define __NO_INLINE__ 1
2025-05-07T20:25:51.2500195Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:51.2500524Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:51.2500869Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:51.2501126Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:51.2501391Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:51.2501648Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:51.2501927Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:51.2502381Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:51.2502647Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:51.2502940Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:51.2503226Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:51.2503497Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:51.2503793Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:51.2504133Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:51.2504426Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:51.2504688Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:51.2504947Z #define __FLT128_DIG__ 33
2025-05-07T20:25:51.2505189Z #define __INT32_C(c) c
2025-05-07T20:25:51.2505426Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:51.2505710Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:51.2505989Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:51.2506266Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:51.2506582Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:51.2506890Z #define unix 1
2025-05-07T20:25:51.2507243Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:51.2507506Z #define __cpp_rtti 199711L
2025-05-07T20:25:51.2507774Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:51.2508085Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2508391Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:51.2508705Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:51.2509036Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:51.2509288Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:51.2509587Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:51.2509869Z #define __ELF__ 1
2025-05-07T20:25:51.2510121Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:51.2510436Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:51.2510719Z #define __FLT_RADIX__ 2
2025-05-07T20:25:51.2510967Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:51.2511327Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:51.2511699Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:51.2511974Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:51.2512254Z #define __k8 1
2025-05-07T20:25:51.2512552Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:51.2512928Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:51.2513220Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:51.2513522Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:51.2513786Z #define __LDBL_DIG__ 18
2025-05-07T20:25:51.2514028Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:51.2514292Z #define __x86_64__ 1
2025-05-07T20:25:51.2514538Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:51.2514835Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:51.2515174Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2515487Z #define __FLT64_DIG__ 15
2025-05-07T20:25:51.2515917Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2516267Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:51.2516599Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2516864Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:51.2517143Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2517444Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:51.2517814Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:51.2518205Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:51.2518501Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:51.2518829Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:51.2519141Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:51.2519466Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:51.2519768Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:51.2520044Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:51.2520361Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:51.2520683Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:51.2520920Z #define __SEG_FS 1
2025-05-07T20:25:51.2521259Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:51.2521541Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:51.2521820Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2522103Z #define __SEG_GS 1
2025-05-07T20:25:51.2522417Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:51.2522806Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:51.2523076Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:51.2523364Z #define __INT16_TYPE__ short int
2025-05-07T20:25:51.2523646Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:51.2523954Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:51.2524257Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:51.2524511Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:51.2524769Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:51.2525115Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:51.2525508Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2525926Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:51.2526248Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:51.2526550Z #define linux 1
2025-05-07T20:25:51.2526786Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2527065Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:51.2527340Z #define __EXCEPTIONS 1
2025-05-07T20:25:51.2527589Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:51.2527849Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:51.2528120Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:51.2528416Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:51.2528758Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:51.2529149Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:51.2529500Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:51.2529831Z #define __code_model_small__ 1
2025-05-07T20:25:51.2530102Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:51.2530420Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:51.2530735Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:51.2531008Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:51.2531301Z #define __k8__ 1
2025-05-07T20:25:51.2531533Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:51.2531814Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:51.2532112Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:51.2532358Z #define __pic__ 2
2025-05-07T20:25:51.2532602Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2532919Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:51.2533190Z #define __cpp_decltype 200707L
2025-05-07T20:25:51.2533477Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2533808Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:51.2534179Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:51.2534539Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:51.2534831Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:51.2535169Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:51.2535461Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:51.2535711Z #define __linux__ 1
2025-05-07T20:25:51.2535940Z #define __INT64_TYPE__ long int
2025-05-07T20:25:51.2536204Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:51.2536461Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:51.2536736Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:51.2537020Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:51.2537331Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:51.2537627Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2537947Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:51.2538212Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:51.2538509Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:51.2538811Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:51.2539159Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:51.2539512Z #define __SSE__ 1
2025-05-07T20:25:51.2539845Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:51.2540216Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:51.2540581Z #define __amd64__ 1
2025-05-07T20:25:51.2540809Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:51.2541067Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:51.2541336Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:51.2541605Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:51.2541881Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:51.2542139Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:51.2542417Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:51.2551906Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:51.2552293Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:51.2552765Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:51.2553127Z #define _LP64 1
2025-05-07T20:25:51.2553344Z #define __UINT8_C(c) c
2025-05-07T20:25:51.2553606Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:51.2554031Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:51.2554302Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:51.2554571Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:51.2554934Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:51.2555402Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:51.2555942Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2556242Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.2556556Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:51.2556861Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:51.2557249Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:51.2557620Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:51.2557886Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:51.2558154Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:51.2558504Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:51.2558883Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:51.2559141Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:51.2559398Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:51.2559653Z #define __FXSR__ 1
2025-05-07T20:25:51.2559954Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.2560410Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:51.2560822Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:51.2561132Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:51.2561404Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:51.2561707Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:51.2561999Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:51.2562271Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:51.2562632Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:51.2562999Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:51.2563275Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:51.2563527Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:51.2563766Z #define __PIC__ 2
2025-05-07T20:25:51.2564013Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:51.2564418Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.2564813Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:51.2565145Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:51.2566490Z #define __cpp_constexpr 201603L
2025-05-07T20:25:51.2566761Z #define __SSE2__ 1
2025-05-07T20:25:51.2566995Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:51.2567290Z #define __INT32_TYPE__ int
2025-05-07T20:25:51.2567546Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:51.2567808Z #define __cpp_exceptions 199711L
2025-05-07T20:25:51.2568090Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:51.2568432Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:51.2569092Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:51.2569365Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:51.2569643Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:51.2569920Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2570220Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:51.2570498Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:51.2570758Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:51.2571051Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:51.2571343Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2571648Z #define __PIE__ 2
2025-05-07T20:25:51.2571967Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:51.2572384Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:51.2572698Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:51.2573047Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:51.2573412Z #define __INT16_C(c) c
2025-05-07T20:25:51.2573645Z #define __STDC__ 1
2025-05-07T20:25:51.2574032Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:51.2574285Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:51.2574564Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:51.2574828Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.2575124Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:51.2575474Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:51.2575813Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:51.2576076Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.2576370Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:51.2576655Z #define __SSE_MATH__ 1
2025-05-07T20:25:51.2576895Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:51.2577182Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:51.2577499Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:51.2577788Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:51.2578076Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.2578355Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:51.2578667Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.2579058Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:51.2579434Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:51.2579747Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:51.2580036Z #define _GNU_SOURCE 1
2025-05-07T20:25:51.2580290Z #define __cpp_init_captures 201304L
2025-05-07T20:25:51.2580575Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:51.2580827Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:51.2580993Z 
2025-05-07T20:25:51.3066755Z 
2025-05-07T20:25:51.3067100Z + conda run -n build_binary c++ --version
2025-05-07T20:25:51.3067352Z 
2025-05-07T20:25:53.1870728Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:53.1871111Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:53.1871607Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:53.1872166Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:53.1872504Z 
2025-05-07T20:25:53.1872509Z 
2025-05-07T20:25:53.2488636Z 
2025-05-07T20:25:53.2489121Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:53.2489660Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:53.2489968Z 
2025-05-07T20:25:55.2110607Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:55.2112521Z 
2025-05-07T20:25:55.2113126Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:55.2113699Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:55.2114008Z 
2025-05-07T20:25:57.1632179Z #define __cplusplus 201703L
2025-05-07T20:25:57.1634286Z 
2025-05-07T20:25:57.1635704Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:57.1670270Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:57.1670701Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:57.1682349Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:57.1682702Z env:
2025-05-07T20:25:57.1682939Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:57.1683255Z   BUILD_ENV: build_binary
2025-05-07T20:25:57.1683506Z   BUILD_TARGET: genai
2025-05-07T20:25:57.1683743Z   BUILD_VARIANT: cuda
2025-05-07T20:25:57.1683984Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:57.1684243Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:57.1684552Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:57.1684896Z ##[endgroup]
2025-05-07T20:25:57.5009842Z ################################################################################
2025-05-07T20:25:57.5010215Z # Install CUDA
2025-05-07T20:25:57.5010423Z #
2025-05-07T20:25:57.5024266Z # [2025-05-07T20:25:57.502Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:57.5024651Z ################################################################################
2025-05-07T20:25:57.5025190Z 
2025-05-07T20:25:57.5039638Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:57.5950996Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:57.5951476Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:57.5956663Z + conda clean --packages --tarball -y
2025-05-07T20:25:57.5956891Z 
2025-05-07T20:25:58.4647438Z Will remove 40 (182.7 MB) tarball(s).
2025-05-07T20:25:58.4648085Z Will remove 7 (108.6 MB) package(s).
2025-05-07T20:25:58.5268953Z 
2025-05-07T20:25:58.5278262Z + conda clean --all -y
2025-05-07T20:25:58.5278420Z 
2025-05-07T20:25:59.1968708Z There are no unused tarball(s) to remove.
2025-05-07T20:25:59.1969042Z Will remove 1 index cache(s).
2025-05-07T20:25:59.1969326Z There are no unused package(s) to remove.
2025-05-07T20:25:59.1969639Z There are no tempfile(s) to remove.
2025-05-07T20:25:59.1969950Z There are no logfile(s) to remove.
2025-05-07T20:25:59.2588766Z 
2025-05-07T20:25:59.2604054Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:59.2627934Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:26:00.1707804Z Channels:
2025-05-07T20:26:00.1708054Z  - conda-forge
2025-05-07T20:26:00.1708286Z Platform: linux-64
2025-05-07T20:26:10.6958248Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:11.8027190Z Solving environment: / - \ | / done
2025-05-07T20:26:11.8787952Z 
2025-05-07T20:26:11.8788313Z ## Package Plan ##
2025-05-07T20:26:11.8788539Z 
2025-05-07T20:26:11.8788819Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:11.8789228Z 
2025-05-07T20:26:11.8789371Z   added / updated specs:
2025-05-07T20:26:11.8789635Z     - cuda=12.8.0
2025-05-07T20:26:11.8789767Z 
2025-05-07T20:26:11.8789797Z 
2025-05-07T20:26:11.8789927Z The following packages will be downloaded:
2025-05-07T20:26:11.8790140Z 
2025-05-07T20:26:11.8790321Z     package                    |            build
2025-05-07T20:26:11.8790763Z     ---------------------------|-----------------
2025-05-07T20:26:11.8791289Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:11.8791904Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:11.8792328Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:11.8792772Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:11.8793349Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:11.8793839Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:26:11.8795701Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:11.8796240Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:11.8796719Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:26:11.8797256Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:26:11.8797701Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:11.8798157Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:26:11.8798648Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:26:11.8799146Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:11.8799661Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:26:11.8800173Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:26:11.8800660Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:26:11.8801107Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:26:11.8801721Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:26:11.8802181Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:26:11.8802635Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:11.8803123Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:26:11.8803585Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:26:11.8804024Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:11.8804488Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:11.8804966Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:11.8805406Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:11.8805880Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:26:11.8806350Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:26:11.8806812Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:26:11.8807280Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:11.8807734Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:26:11.8808198Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:26:11.8808646Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:26:11.8809099Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:26:11.8809545Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:26:11.8809994Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:26:11.8810443Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:11.8810897Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:26:11.8811391Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:26:11.8811891Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:26:11.8812337Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:26:11.8812766Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:11.8813222Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:26:11.8813839Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:11.8814305Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:11.8814778Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:26:11.8815259Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:11.8815702Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:11.8816132Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:26:11.8816594Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:11.8817066Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:11.8817488Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:11.8817948Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:11.8818474Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:11.8818994Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:11.8819564Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:11.8820010Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:11.8820473Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:11.8820950Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:11.8821386Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:11.8821836Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:11.8822244Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:26:11.8822641Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:11.8823027Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:11.8823427Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:11.8823832Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:11.8824219Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:11.8824640Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:26:11.8825096Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:26:11.8825539Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:26:11.8825980Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:11.8826423Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:26:11.8826876Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:26:11.8827319Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:26:11.8827777Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:26:11.8828232Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:26:11.8828697Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:26:11.8829165Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:26:11.8829633Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:11.8830103Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:26:11.8830548Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:11.8830995Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:11.8831554Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:11.8831995Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:11.8832411Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:26:11.8832845Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:11.8833275Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:11.8833676Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:11.8834088Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:26:11.8834521Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:26:11.8834952Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:11.8835383Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:26:11.8835957Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:11.8836514Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:26:11.8836980Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:11.8837442Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:26:11.8837891Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:11.8838338Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:26:11.8838750Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:11.8839168Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:11.8839606Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:11.8840049Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:11.8840456Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:11.8840890Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:11.8841333Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:11.8841747Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:11.8842156Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:11.8842573Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:11.8843016Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:26:11.8843459Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:11.8843847Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:11.8844249Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:11.8844701Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:11.8845139Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:11.8845573Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:11.8846012Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:11.8846426Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:26:11.8846823Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:11.8847229Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:11.8847646Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:11.8856854Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:11.8857429Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:11.8857969Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:11.8858509Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:11.8858982Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:11.8859439Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:11.8859900Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:11.8860330Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:11.8860764Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:11.8861204Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:11.8861674Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:11.8862165Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:11.8862757Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:11.8863216Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:11.8863665Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:11.8864110Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:11.8864558Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:11.8865022Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:11.8865855Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:11.8866362Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:11.8866913Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:11.8867299Z     ------------------------------------------------------------
2025-05-07T20:26:11.8867650Z                                            Total:        1.88 GB
2025-05-07T20:26:11.8867861Z 
2025-05-07T20:26:11.8868001Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:11.8868224Z 
2025-05-07T20:26:11.8868435Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:11.8868859Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:11.8869282Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:11.8869744Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:11.8870176Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:26:11.8870657Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:26:11.8871254Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:11.8871998Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:26:11.8872539Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:11.8873100Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:26:11.8873621Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:26:11.8874146Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:11.8874719Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:11.8875324Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:26:11.8876249Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:11.8876864Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:11.8877424Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8877944Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:26:11.8878455Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:26:11.8879039Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8879795Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:11.8880497Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:26:11.8881036Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:26:11.8881540Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:26:11.8882115Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:26:11.8882672Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:26:11.8883340Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:26:11.8883983Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:26:11.8884609Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:26:11.8885158Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:26:11.8885717Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:26:11.8886262Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8886780Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8887297Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:26:11.8887805Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8888299Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:26:11.8888823Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:11.8889324Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8889846Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:11.8890406Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:26:11.8890956Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:26:11.8891475Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:26:11.8892061Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8892654Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:11.8893233Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:26:11.8893786Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:26:11.8894344Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8894893Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:26:11.8895377Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:11.8895858Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:26:11.8896394Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:11.8896935Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:11.8897392Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:11.8898035Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:11.8898660Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:11.8899263Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:11.8899842Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:11.8900345Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:11.8900839Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:11.8901384Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:11.8901869Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:11.8902300Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:11.8902732Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:26:11.8903214Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:11.8903642Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:11.8904159Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:11.8904586Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:11.8904998Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:11.8905455Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:11.8905978Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:11.8906559Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:26:11.8907128Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:26:11.8907714Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:26:11.8908295Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:26:11.8908897Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:26:11.8909417Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:26:11.8909947Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:11.8910487Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:11.8911034Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:26:11.8911624Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:26:11.8912151Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:26:11.8912634Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:11.8913145Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:11.8913668Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:11.8914155Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:11.8914602Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:11.8915076Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:11.8915653Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:11.8916090Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:11.8916519Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:26:11.8916994Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:26:11.8917468Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:11.8917935Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:26:11.8918589Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:11.8919137Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:26:11.8919691Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:11.8920221Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:26:11.8920738Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:26:11.8921296Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:11.8921747Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:11.8922199Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:11.8922669Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:11.8923107Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:11.8923584Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:11.8924073Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:11.8924638Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:11.8925067Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:11.8925565Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:26:11.8926055Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:11.8926446Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:11.8926861Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:11.8927359Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:11.8927860Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:11.8928345Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:11.8928849Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:11.8929300Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:11.8929744Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:11.8930242Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:11.8930786Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:11.8931351Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:11.8931969Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:11.8932509Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:11.8933029Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:11.8933560Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:11.8934038Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:11.8934628Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:11.8935117Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:11.8935659Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:11.8936236Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:11.8936771Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:11.8937278Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:11.8937793Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:11.8938298Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:11.8938940Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:11.8939487Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:11.8940024Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:11.8940478Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:11.8940726Z 
2025-05-07T20:26:11.8940848Z The following packages will be UPDATED:
2025-05-07T20:26:11.8941055Z 
2025-05-07T20:26:11.8941220Z   libsqlite                               3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 
2025-05-07T20:26:11.8941760Z   libzlib                                 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:11.8942220Z   zlib                                    1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:11.8942462Z 
2025-05-07T20:26:11.8942688Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:11.8943002Z 
2025-05-07T20:26:11.8943271Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:26:11.8943960Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:11.8944291Z 
2025-05-07T20:26:11.8944327Z 
2025-05-07T20:26:11.8944331Z 
2025-05-07T20:26:11.8944478Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:11.8944863Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:11.8945100Z 
2025-05-07T20:26:11.8945516Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:11.8945766Z 
2025-05-07T20:26:11.8945770Z 
2025-05-07T20:26:11.8945995Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:26:11.8946257Z 
2025-05-07T20:26:11.8946261Z 
2025-05-07T20:26:11.8946264Z 
2025-05-07T20:26:11.8946494Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:11.8946761Z 
2025-05-07T20:26:11.8946765Z 
2025-05-07T20:26:11.8946774Z 
2025-05-07T20:26:11.8946778Z 
2025-05-07T20:26:11.8947010Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:11.8947280Z 
2025-05-07T20:26:11.8947284Z 
2025-05-07T20:26:11.8947288Z 
2025-05-07T20:26:11.8947291Z 
2025-05-07T20:26:11.8947295Z 
2025-05-07T20:26:11.8947527Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:11.8947785Z 
2025-05-07T20:26:11.8947796Z 
2025-05-07T20:26:11.8947800Z 
2025-05-07T20:26:11.8947803Z 
2025-05-07T20:26:11.8947807Z 
2025-05-07T20:26:11.8947811Z 
2025-05-07T20:26:11.8948063Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:11.8948341Z 
2025-05-07T20:26:11.8948344Z 
2025-05-07T20:26:11.8948357Z 
2025-05-07T20:26:11.8948361Z 
2025-05-07T20:26:11.8948364Z 
2025-05-07T20:26:11.8948368Z 
2025-05-07T20:26:11.8948371Z 
2025-05-07T20:26:11.8964039Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:11.8964332Z 
2025-05-07T20:26:11.8964344Z 
2025-05-07T20:26:11.8964348Z 
2025-05-07T20:26:11.8964351Z 
2025-05-07T20:26:11.8964355Z 
2025-05-07T20:26:11.8964358Z 
2025-05-07T20:26:11.8964367Z 
2025-05-07T20:26:11.8966874Z 
2025-05-07T20:26:11.8967921Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8968210Z 
2025-05-07T20:26:11.8968215Z 
2025-05-07T20:26:11.8968218Z 
2025-05-07T20:26:11.8968222Z 
2025-05-07T20:26:11.8968234Z 
2025-05-07T20:26:11.8968245Z 
2025-05-07T20:26:11.8968249Z 
2025-05-07T20:26:11.8968253Z 
2025-05-07T20:26:11.8968256Z 
2025-05-07T20:26:11.8969233Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8969519Z 
2025-05-07T20:26:11.8969533Z 
2025-05-07T20:26:11.8969538Z 
2025-05-07T20:26:11.8969554Z 
2025-05-07T20:26:11.8969559Z 
2025-05-07T20:26:11.8969564Z 
2025-05-07T20:26:11.8969569Z 
2025-05-07T20:26:11.8969574Z 
2025-05-07T20:26:11.8969579Z 
2025-05-07T20:26:11.8969584Z 
2025-05-07T20:26:11.8970710Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8971013Z 
2025-05-07T20:26:11.8971018Z 
2025-05-07T20:26:11.8971028Z 
2025-05-07T20:26:11.8971031Z 
2025-05-07T20:26:11.8971035Z 
2025-05-07T20:26:11.8971048Z 
2025-05-07T20:26:11.8971052Z 
2025-05-07T20:26:11.8971056Z 
2025-05-07T20:26:11.8971068Z 
2025-05-07T20:26:11.8971071Z 
2025-05-07T20:26:11.8971075Z 
2025-05-07T20:26:11.8971819Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8972183Z 
2025-05-07T20:26:11.8972189Z 
2025-05-07T20:26:11.8972206Z 
2025-05-07T20:26:11.8972211Z 
2025-05-07T20:26:11.8972217Z 
2025-05-07T20:26:11.8972222Z 
2025-05-07T20:26:11.8972227Z 
2025-05-07T20:26:11.8972232Z 
2025-05-07T20:26:11.8972237Z 
2025-05-07T20:26:11.8972243Z 
2025-05-07T20:26:11.8972247Z 
2025-05-07T20:26:11.8972253Z 
2025-05-07T20:26:11.8972737Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8973063Z 
2025-05-07T20:26:11.8973068Z 
2025-05-07T20:26:11.8973080Z 
2025-05-07T20:26:11.8973084Z 
2025-05-07T20:26:11.8973088Z 
2025-05-07T20:26:11.8973265Z 
2025-05-07T20:26:11.8973269Z 
2025-05-07T20:26:11.8973273Z 
2025-05-07T20:26:11.8973276Z 
2025-05-07T20:26:11.8973280Z 
2025-05-07T20:26:11.8973284Z 
2025-05-07T20:26:11.8973287Z 
2025-05-07T20:26:11.8973291Z 
2025-05-07T20:26:11.8974022Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8974321Z 
2025-05-07T20:26:11.8974335Z 
2025-05-07T20:26:11.8974338Z 
2025-05-07T20:26:11.8974342Z 
2025-05-07T20:26:11.8974346Z 
2025-05-07T20:26:11.8974349Z 
2025-05-07T20:26:11.8974352Z 
2025-05-07T20:26:11.8974356Z 
2025-05-07T20:26:11.8974359Z 
2025-05-07T20:26:11.8974363Z 
2025-05-07T20:26:11.8974373Z 
2025-05-07T20:26:11.8974377Z 
2025-05-07T20:26:11.8974380Z 
2025-05-07T20:26:11.8974384Z 
2025-05-07T20:26:11.8975168Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8975466Z 
2025-05-07T20:26:11.8975475Z 
2025-05-07T20:26:11.8975488Z 
2025-05-07T20:26:11.8975491Z 
2025-05-07T20:26:11.8975504Z 
2025-05-07T20:26:11.8975507Z 
2025-05-07T20:26:11.8975511Z 
2025-05-07T20:26:11.8975514Z 
2025-05-07T20:26:11.8975518Z 
2025-05-07T20:26:11.8975521Z 
2025-05-07T20:26:11.8975524Z 
2025-05-07T20:26:11.8975528Z 
2025-05-07T20:26:11.8975531Z 
2025-05-07T20:26:11.8975535Z 
2025-05-07T20:26:11.8975538Z 
2025-05-07T20:26:11.8976413Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8976726Z 
2025-05-07T20:26:11.8976730Z 
2025-05-07T20:26:11.8976733Z 
2025-05-07T20:26:11.8976737Z 
2025-05-07T20:26:11.8976740Z 
2025-05-07T20:26:11.8976744Z 
2025-05-07T20:26:11.8976747Z 
2025-05-07T20:26:11.8976751Z 
2025-05-07T20:26:11.8976754Z 
2025-05-07T20:26:11.8976758Z 
2025-05-07T20:26:11.8976761Z 
2025-05-07T20:26:11.8976765Z 
2025-05-07T20:26:11.8976778Z 
2025-05-07T20:26:11.8976781Z 
2025-05-07T20:26:11.8976789Z 
2025-05-07T20:26:11.8976792Z 
2025-05-07T20:26:11.8978162Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8978499Z 
2025-05-07T20:26:11.8978502Z 
2025-05-07T20:26:11.8978506Z 
2025-05-07T20:26:11.8978509Z 
2025-05-07T20:26:11.8978513Z 
2025-05-07T20:26:11.8978516Z 
2025-05-07T20:26:11.8978520Z 
2025-05-07T20:26:11.8978523Z 
2025-05-07T20:26:11.8978527Z 
2025-05-07T20:26:11.8978530Z 
2025-05-07T20:26:11.8978534Z 
2025-05-07T20:26:11.8978537Z 
2025-05-07T20:26:11.8978541Z 
2025-05-07T20:26:11.8978544Z 
2025-05-07T20:26:11.8978548Z 
2025-05-07T20:26:11.8978551Z 
2025-05-07T20:26:11.8978599Z 
2025-05-07T20:26:11.8980459Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8980865Z 
2025-05-07T20:26:11.8980869Z 
2025-05-07T20:26:11.8980873Z 
2025-05-07T20:26:11.8980876Z 
2025-05-07T20:26:11.8980880Z 
2025-05-07T20:26:11.8980883Z 
2025-05-07T20:26:11.8981037Z 
2025-05-07T20:26:11.8981046Z 
2025-05-07T20:26:11.8981051Z 
2025-05-07T20:26:11.8981056Z 
2025-05-07T20:26:11.8981061Z 
2025-05-07T20:26:11.8981074Z 
2025-05-07T20:26:11.8981079Z 
2025-05-07T20:26:11.8981084Z 
2025-05-07T20:26:11.8981089Z 
2025-05-07T20:26:11.8981105Z 
2025-05-07T20:26:11.8981110Z 
2025-05-07T20:26:11.8981115Z 
2025-05-07T20:26:11.8981533Z cuda-cupti-dev-12.8. | 4.0 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8981843Z 
2025-05-07T20:26:11.8981847Z 
2025-05-07T20:26:11.8981865Z 
2025-05-07T20:26:11.8981868Z 
2025-05-07T20:26:11.8981872Z 
2025-05-07T20:26:11.8981875Z 
2025-05-07T20:26:11.8981879Z 
2025-05-07T20:26:11.8981883Z 
2025-05-07T20:26:11.8981887Z 
2025-05-07T20:26:11.8981890Z 
2025-05-07T20:26:11.8981894Z 
2025-05-07T20:26:11.8981897Z 
2025-05-07T20:26:11.8981901Z 
2025-05-07T20:26:11.8981904Z 
2025-05-07T20:26:11.8981908Z 
2025-05-07T20:26:11.8981911Z 
2025-05-07T20:26:11.8981915Z 
2025-05-07T20:26:11.8981918Z 
2025-05-07T20:26:11.8981928Z 
2025-05-07T20:26:11.9886802Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.9887408Z 
2025-05-07T20:26:11.9906141Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:11.9906425Z 
2025-05-07T20:26:11.9907106Z 
2025-05-07T20:26:11.9937414Z libcusparse-12.5.7.5 | 164.9 MB  | 3          |   4% [A[A
2025-05-07T20:26:11.9937753Z 
2025-05-07T20:26:11.9937759Z 
2025-05-07T20:26:11.9937765Z 
2025-05-07T20:26:11.9938950Z 
2025-05-07T20:26:12.0084846Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:12.0134355Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:12.0134724Z 
2025-05-07T20:26:12.0134732Z 
2025-05-07T20:26:12.0136661Z 
2025-05-07T20:26:12.0889524Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:12.0892995Z 
2025-05-07T20:26:12.0940224Z nsight-compute-2025. | 320.6 MB  | 1          |   2% [A
2025-05-07T20:26:12.0940499Z 
2025-05-07T20:26:12.0940663Z 
2025-05-07T20:26:12.0940670Z 
2025-05-07T20:26:12.0941312Z 
2025-05-07T20:26:12.1097625Z libcufft-11.3.3.41   | 147.4 MB  | 3          |   3% [A[A[A[A
2025-05-07T20:26:12.1135749Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:12.1136008Z 
2025-05-07T20:26:12.1136012Z 
2025-05-07T20:26:12.1138141Z 
2025-05-07T20:26:12.1370359Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   2% [A[A[A
2025-05-07T20:26:12.1370713Z 
2025-05-07T20:26:12.1372455Z 
2025-05-07T20:26:12.1890790Z libcusparse-12.5.7.5 | 164.9 MB  | 7          |   7% [A[A
2025-05-07T20:26:12.1891103Z 
2025-05-07T20:26:12.1941618Z nsight-compute-2025. | 320.6 MB  | 3          |   3% [A
2025-05-07T20:26:12.1941897Z 
2025-05-07T20:26:12.1941901Z 
2025-05-07T20:26:12.1941905Z 
2025-05-07T20:26:12.1944897Z 
2025-05-07T20:26:12.2102765Z libcufft-11.3.3.41   | 147.4 MB  | 5          |   6% [A[A[A[A
2025-05-07T20:26:12.2137881Z libcublas-12.8.3.14  | 460.2 MB  |            |   1% 
2025-05-07T20:26:12.2138175Z 
2025-05-07T20:26:12.2138179Z 
2025-05-07T20:26:12.2139492Z 
2025-05-07T20:26:12.2693983Z libcusolver-11.7.2.5 | 156.9 MB  | 4          |   4% [A[A[A
2025-05-07T20:26:12.2694314Z 
2025-05-07T20:26:12.2694318Z 
2025-05-07T20:26:12.2892862Z libcusparse-12.5.7.5 | 164.9 MB  | #          |  10% [A[A
2025-05-07T20:26:12.2896626Z 
2025-05-07T20:26:12.2946916Z nsight-compute-2025. | 320.6 MB  | 4          |   4% [A
2025-05-07T20:26:12.2947187Z 
2025-05-07T20:26:12.2947191Z 
2025-05-07T20:26:12.2947195Z 
2025-05-07T20:26:12.2948207Z 
2025-05-07T20:26:12.3107401Z libcufft-11.3.3.41   | 147.4 MB  | 8          |   9% [A[A[A[A
2025-05-07T20:26:12.3140126Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   2% 
2025-05-07T20:26:12.3140607Z 
2025-05-07T20:26:12.3140750Z 
2025-05-07T20:26:12.3141617Z 
2025-05-07T20:26:12.3882771Z libcusolver-11.7.2.5 | 156.9 MB  | 6          |   7% [A[A[A
2025-05-07T20:26:12.3883154Z 
2025-05-07T20:26:12.3885532Z 
2025-05-07T20:26:12.3896588Z libcusparse-12.5.7.5 | 164.9 MB  | #2         |  13% [A[A
2025-05-07T20:26:12.3896971Z 
2025-05-07T20:26:12.3947368Z nsight-compute-2025. | 320.6 MB  | 5          |   5% [A
2025-05-07T20:26:12.3947686Z 
2025-05-07T20:26:12.3947692Z 
2025-05-07T20:26:12.3947698Z 
2025-05-07T20:26:12.3948212Z 
2025-05-07T20:26:12.4110583Z libcufft-11.3.3.41   | 147.4 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:26:12.4144613Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   2% 
2025-05-07T20:26:12.4144890Z 
2025-05-07T20:26:12.4144896Z 
2025-05-07T20:26:12.4146777Z 
2025-05-07T20:26:12.4900529Z libcusolver-11.7.2.5 | 156.9 MB  | 8          |   9% [A[A[A
2025-05-07T20:26:12.4900962Z 
2025-05-07T20:26:12.5047300Z nsight-compute-2025. | 320.6 MB  | 6          |   7% [A
2025-05-07T20:26:12.5047609Z 
2025-05-07T20:26:12.5047614Z 
2025-05-07T20:26:12.5047617Z 
2025-05-07T20:26:12.5052475Z 
2025-05-07T20:26:12.5055422Z libcufft-11.3.3.41   | 147.4 MB  | #3         |  14% [A[A[A[A
2025-05-07T20:26:12.5055697Z 
2025-05-07T20:26:12.5055701Z 
2025-05-07T20:26:12.5112786Z libcusparse-12.5.7.5 | 164.9 MB  | #5         |  15% [A[A
2025-05-07T20:26:12.5145381Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   3% 
2025-05-07T20:26:12.5145929Z 
2025-05-07T20:26:12.5145934Z 
2025-05-07T20:26:12.5145937Z 
2025-05-07T20:26:12.5983974Z libcusolver-11.7.2.5 | 156.9 MB  | #          |  11% [A[A[A
2025-05-07T20:26:12.5984317Z 
2025-05-07T20:26:12.6114008Z nsight-compute-2025. | 320.6 MB  | 7          |   8% [A
2025-05-07T20:26:12.6148721Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   4% 
2025-05-07T20:26:12.6148983Z 
2025-05-07T20:26:12.6148987Z 
2025-05-07T20:26:12.6150745Z 
2025-05-07T20:26:12.6237984Z libcusolver-11.7.2.5 | 156.9 MB  | #3         |  13% [A[A[A
2025-05-07T20:26:12.6238273Z 
2025-05-07T20:26:12.6238277Z 
2025-05-07T20:26:12.6238281Z 
2025-05-07T20:26:12.6239445Z 
2025-05-07T20:26:12.6329644Z libcufft-11.3.3.41   | 147.4 MB  | #5         |  16% [A[A[A[A
2025-05-07T20:26:12.6329923Z 
2025-05-07T20:26:12.6334368Z 
2025-05-07T20:26:12.6987108Z libcusparse-12.5.7.5 | 164.9 MB  | #7         |  18% [A[A
2025-05-07T20:26:12.6990014Z 
2025-05-07T20:26:12.7113589Z nsight-compute-2025. | 320.6 MB  | 8          |   9% [A
2025-05-07T20:26:12.7156193Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:26:12.7156450Z 
2025-05-07T20:26:12.7156454Z 
2025-05-07T20:26:12.7156457Z 
2025-05-07T20:26:12.7349821Z libcusolver-11.7.2.5 | 156.9 MB  | #5         |  15% [A[A[A
2025-05-07T20:26:12.7350112Z 
2025-05-07T20:26:12.7350116Z 
2025-05-07T20:26:12.7350120Z 
2025-05-07T20:26:12.7351792Z 
2025-05-07T20:26:12.7566566Z libcufft-11.3.3.41   | 147.4 MB  | #8         |  18% [A[A[A[A
2025-05-07T20:26:12.7566886Z 
2025-05-07T20:26:12.7566890Z 
2025-05-07T20:26:12.7991596Z libcusparse-12.5.7.5 | 164.9 MB  | #9         |  20% [A[A
2025-05-07T20:26:12.7991887Z 
2025-05-07T20:26:12.8116565Z nsight-compute-2025. | 320.6 MB  | #          |  10% [A
2025-05-07T20:26:12.8168715Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:26:12.8169002Z 
2025-05-07T20:26:12.8169134Z 
2025-05-07T20:26:12.8171896Z 
2025-05-07T20:26:12.8395872Z libcusolver-11.7.2.5 | 156.9 MB  | #7         |  17% [A[A[A
2025-05-07T20:26:12.8396182Z 
2025-05-07T20:26:12.8396186Z 
2025-05-07T20:26:12.8396190Z 
2025-05-07T20:26:12.8396194Z 
2025-05-07T20:26:12.8771706Z libcufft-11.3.3.41   | 147.4 MB  | ##         |  20% [A[A[A[A
2025-05-07T20:26:12.8772001Z 
2025-05-07T20:26:12.8773289Z 
2025-05-07T20:26:12.9003852Z libcusparse-12.5.7.5 | 164.9 MB  | ##1        |  22% [A[A
2025-05-07T20:26:12.9005670Z 
2025-05-07T20:26:12.9123469Z nsight-compute-2025. | 320.6 MB  | #1         |  11% [A
2025-05-07T20:26:12.9169835Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:26:12.9170084Z 
2025-05-07T20:26:12.9170386Z 
2025-05-07T20:26:12.9170427Z 
2025-05-07T20:26:12.9540884Z libcusolver-11.7.2.5 | 156.9 MB  | #9         |  20% [A[A[A
2025-05-07T20:26:12.9541180Z 
2025-05-07T20:26:12.9541184Z 
2025-05-07T20:26:12.9541188Z 
2025-05-07T20:26:12.9542506Z 
2025-05-07T20:26:12.9838281Z libcufft-11.3.3.41   | 147.4 MB  | ##2        |  23% [A[A[A[A
2025-05-07T20:26:12.9838564Z 
2025-05-07T20:26:12.9839637Z 
2025-05-07T20:26:13.0012470Z libcusparse-12.5.7.5 | 164.9 MB  | ##3        |  24% [A[A
2025-05-07T20:26:13.0012747Z 
2025-05-07T20:26:13.0128832Z nsight-compute-2025. | 320.6 MB  | #2         |  12% [A
2025-05-07T20:26:13.0176409Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:26:13.0176664Z 
2025-05-07T20:26:13.0176956Z 
2025-05-07T20:26:13.0176962Z 
2025-05-07T20:26:13.0581293Z libcusolver-11.7.2.5 | 156.9 MB  | ##1        |  22% [A[A[A
2025-05-07T20:26:13.0581580Z 
2025-05-07T20:26:13.0581585Z 
2025-05-07T20:26:13.0581588Z 
2025-05-07T20:26:13.0581592Z 
2025-05-07T20:26:13.0955847Z libcufft-11.3.3.41   | 147.4 MB  | ##4        |  25% [A[A[A[A
2025-05-07T20:26:13.0956131Z 
2025-05-07T20:26:13.0957658Z 
2025-05-07T20:26:13.1062732Z libcusparse-12.5.7.5 | 164.9 MB  | ##5        |  26% [A[A
2025-05-07T20:26:13.1063802Z 
2025-05-07T20:26:13.1224457Z nsight-compute-2025. | 320.6 MB  | #3         |  13% [A
2025-05-07T20:26:13.1224724Z 
2025-05-07T20:26:13.1224974Z 
2025-05-07T20:26:13.1225111Z 
2025-05-07T20:26:13.1259783Z libcusolver-11.7.2.5 | 156.9 MB  | ##4        |  24% [A[A[A
2025-05-07T20:26:13.1584849Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   7% 
2025-05-07T20:26:13.1585112Z 
2025-05-07T20:26:13.1585393Z 
2025-05-07T20:26:13.1585402Z 
2025-05-07T20:26:13.1585408Z 
2025-05-07T20:26:13.1987880Z libcufft-11.3.3.41   | 147.4 MB  | ##6        |  27% [A[A[A[A
2025-05-07T20:26:13.1988182Z 
2025-05-07T20:26:13.1988188Z 
2025-05-07T20:26:13.2156299Z libcusparse-12.5.7.5 | 164.9 MB  | ##7        |  27% [A[A
2025-05-07T20:26:13.2156588Z 
2025-05-07T20:26:13.2322960Z nsight-compute-2025. | 320.6 MB  | #4         |  15% [A
2025-05-07T20:26:13.2438216Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:26:13.2438593Z 
2025-05-07T20:26:13.2438602Z 
2025-05-07T20:26:13.2439797Z 
2025-05-07T20:26:13.2587029Z libcusolver-11.7.2.5 | 156.9 MB  | ##6        |  26% [A[A[A
2025-05-07T20:26:13.2587316Z 
2025-05-07T20:26:13.2587341Z 
2025-05-07T20:26:13.2587347Z 
2025-05-07T20:26:13.2587352Z 
2025-05-07T20:26:13.2990846Z libcufft-11.3.3.41   | 147.4 MB  | ##8        |  29% [A[A[A[A
2025-05-07T20:26:13.2991244Z 
2025-05-07T20:26:13.2991251Z 
2025-05-07T20:26:13.3281280Z libcusparse-12.5.7.5 | 164.9 MB  | ##9        |  29% [A[A
2025-05-07T20:26:13.3281665Z 
2025-05-07T20:26:13.3325995Z nsight-compute-2025. | 320.6 MB  | #5         |  16% [A
2025-05-07T20:26:13.3442252Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   9% 
2025-05-07T20:26:13.3442602Z 
2025-05-07T20:26:13.3442608Z 
2025-05-07T20:26:13.3444954Z 
2025-05-07T20:26:13.3623352Z libcusolver-11.7.2.5 | 156.9 MB  | ##8        |  28% [A[A[A
2025-05-07T20:26:13.3623703Z 
2025-05-07T20:26:13.3623708Z 
2025-05-07T20:26:13.3623712Z 
2025-05-07T20:26:13.3625590Z 
2025-05-07T20:26:13.3991464Z libcufft-11.3.3.41   | 147.4 MB  | ###1       |  31% [A[A[A[A
2025-05-07T20:26:13.3991765Z 
2025-05-07T20:26:13.3993705Z 
2025-05-07T20:26:13.4288680Z libcusparse-12.5.7.5 | 164.9 MB  | ###1       |  31% [A[A
2025-05-07T20:26:13.4291734Z 
2025-05-07T20:26:13.4328218Z nsight-compute-2025. | 320.6 MB  | #6         |  17% [A
2025-05-07T20:26:13.4444564Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:26:13.4444822Z 
2025-05-07T20:26:13.4445174Z 
2025-05-07T20:26:13.4445184Z 
2025-05-07T20:26:13.4624550Z libcusolver-11.7.2.5 | 156.9 MB  | ###        |  31% [A[A[A
2025-05-07T20:26:13.4624838Z 
2025-05-07T20:26:13.4624843Z 
2025-05-07T20:26:13.4624848Z 
2025-05-07T20:26:13.4626764Z 
2025-05-07T20:26:13.4992556Z libcufft-11.3.3.41   | 147.4 MB  | ###3       |  33% [A[A[A[A
2025-05-07T20:26:13.4992853Z 
2025-05-07T20:26:13.4992861Z 
2025-05-07T20:26:13.5327804Z libcusparse-12.5.7.5 | 164.9 MB  | ###3       |  34% [A[A
2025-05-07T20:26:13.5329088Z 
2025-05-07T20:26:13.5338697Z nsight-compute-2025. | 320.6 MB  | #7         |  18% [A
2025-05-07T20:26:13.5446146Z libcublas-12.8.3.14  | 460.2 MB  | #          |  10% 
2025-05-07T20:26:13.5446400Z 
2025-05-07T20:26:13.5446422Z 
2025-05-07T20:26:13.5446425Z 
2025-05-07T20:26:13.5694500Z libcusolver-11.7.2.5 | 156.9 MB  | ###2       |  33% [A[A[A
2025-05-07T20:26:13.5694847Z 
2025-05-07T20:26:13.5694856Z 
2025-05-07T20:26:13.5694864Z 
2025-05-07T20:26:13.5700616Z 
2025-05-07T20:26:13.5996650Z libcufft-11.3.3.41   | 147.4 MB  | ###5       |  35% [A[A[A[A
2025-05-07T20:26:13.5997053Z 
2025-05-07T20:26:13.5997061Z 
2025-05-07T20:26:13.6333014Z libcusparse-12.5.7.5 | 164.9 MB  | ###5       |  35% [A[A
2025-05-07T20:26:13.6334407Z 
2025-05-07T20:26:13.6380108Z nsight-compute-2025. | 320.6 MB  | #8         |  19% [A
2025-05-07T20:26:13.6447829Z libcublas-12.8.3.14  | 460.2 MB  | #          |  11% 
2025-05-07T20:26:13.6448089Z 
2025-05-07T20:26:13.6448094Z 
2025-05-07T20:26:13.6449544Z 
2025-05-07T20:26:13.6696285Z libcusolver-11.7.2.5 | 156.9 MB  | ###4       |  35% [A[A[A
2025-05-07T20:26:13.6696603Z 
2025-05-07T20:26:13.6696609Z 
2025-05-07T20:26:13.6696613Z 
2025-05-07T20:26:13.6699323Z 
2025-05-07T20:26:13.6999505Z libcufft-11.3.3.41   | 147.4 MB  | ###7       |  38% [A[A[A[A
2025-05-07T20:26:13.7000052Z 
2025-05-07T20:26:13.7001597Z 
2025-05-07T20:26:13.7339152Z libcusparse-12.5.7.5 | 164.9 MB  | ###7       |  38% [A[A
2025-05-07T20:26:13.7339432Z 
2025-05-07T20:26:13.7382171Z nsight-compute-2025. | 320.6 MB  | #9         |  20% [A
2025-05-07T20:26:13.7448413Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:26:13.7448665Z 
2025-05-07T20:26:13.7449125Z 
2025-05-07T20:26:13.7450968Z 
2025-05-07T20:26:13.7697998Z libcusolver-11.7.2.5 | 156.9 MB  | ###7       |  37% [A[A[A
2025-05-07T20:26:13.7698286Z 
2025-05-07T20:26:13.7698290Z 
2025-05-07T20:26:13.7698294Z 
2025-05-07T20:26:13.7698298Z 
2025-05-07T20:26:13.8002580Z libcufft-11.3.3.41   | 147.4 MB  | ####       |  41% [A[A[A[A
2025-05-07T20:26:13.8002862Z 
2025-05-07T20:26:13.8006275Z 
2025-05-07T20:26:13.8340104Z libcusparse-12.5.7.5 | 164.9 MB  | ###9       |  40% [A[A
2025-05-07T20:26:13.8341058Z 
2025-05-07T20:26:13.8383209Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:26:13.8452280Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  12% 
2025-05-07T20:26:13.8452525Z 
2025-05-07T20:26:13.8452734Z 
2025-05-07T20:26:13.8456106Z 
2025-05-07T20:26:13.8702587Z libcusolver-11.7.2.5 | 156.9 MB  | ###9       |  39% [A[A[A
2025-05-07T20:26:13.8702897Z 
2025-05-07T20:26:13.8702914Z 
2025-05-07T20:26:13.8702924Z 
2025-05-07T20:26:13.8706453Z 
2025-05-07T20:26:13.9004048Z libcufft-11.3.3.41   | 147.4 MB  | ####3      |  43% [A[A[A[A
2025-05-07T20:26:13.9004333Z 
2025-05-07T20:26:13.9004346Z 
2025-05-07T20:26:13.9341683Z libcusparse-12.5.7.5 | 164.9 MB  | ####2      |  42% [A[A
2025-05-07T20:26:13.9344167Z 
2025-05-07T20:26:13.9384997Z nsight-compute-2025. | 320.6 MB  | ##2        |  22% [A
2025-05-07T20:26:13.9452974Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:26:13.9453264Z 
2025-05-07T20:26:13.9453269Z 
2025-05-07T20:26:13.9454828Z 
2025-05-07T20:26:13.9706600Z libcusolver-11.7.2.5 | 156.9 MB  | ####1      |  42% [A[A[A
2025-05-07T20:26:13.9707030Z 
2025-05-07T20:26:13.9707037Z 
2025-05-07T20:26:13.9707042Z 
2025-05-07T20:26:13.9707047Z 
2025-05-07T20:26:14.0008835Z libcufft-11.3.3.41   | 147.4 MB  | ####5      |  46% [A[A[A[A
2025-05-07T20:26:14.0009182Z 
2025-05-07T20:26:14.0009869Z 
2025-05-07T20:26:14.0346090Z libcusparse-12.5.7.5 | 164.9 MB  | ####4      |  44% [A[A
2025-05-07T20:26:14.0347391Z 
2025-05-07T20:26:14.0390936Z nsight-compute-2025. | 320.6 MB  | ##3        |  23% [A
2025-05-07T20:26:14.0475237Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  14% 
2025-05-07T20:26:14.0475705Z 
2025-05-07T20:26:14.0475712Z 
2025-05-07T20:26:14.0475717Z 
2025-05-07T20:26:14.0706844Z libcusolver-11.7.2.5 | 156.9 MB  | ####3      |  44% [A[A[A
2025-05-07T20:26:14.0707255Z 
2025-05-07T20:26:14.0707261Z 
2025-05-07T20:26:14.0707266Z 
2025-05-07T20:26:14.0707539Z 
2025-05-07T20:26:14.1010158Z libcufft-11.3.3.41   | 147.4 MB  | ####8      |  48% [A[A[A[A
2025-05-07T20:26:14.1010459Z 
2025-05-07T20:26:14.1011066Z 
2025-05-07T20:26:14.1380385Z libcusparse-12.5.7.5 | 164.9 MB  | ####6      |  47% [A[A
2025-05-07T20:26:14.1381255Z 
2025-05-07T20:26:14.1392735Z nsight-compute-2025. | 320.6 MB  | ##4        |  25% [A
2025-05-07T20:26:14.1481808Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:26:14.1482057Z 
2025-05-07T20:26:14.1482061Z 
2025-05-07T20:26:14.1482065Z 
2025-05-07T20:26:14.1709817Z libcusolver-11.7.2.5 | 156.9 MB  | ####6      |  46% [A[A[A
2025-05-07T20:26:14.1710217Z 
2025-05-07T20:26:14.1710223Z 
2025-05-07T20:26:14.1710228Z 
2025-05-07T20:26:14.1710233Z 
2025-05-07T20:26:14.2013781Z libcufft-11.3.3.41   | 147.4 MB  | #####      |  51% [A[A[A[A
2025-05-07T20:26:14.2014176Z 
2025-05-07T20:26:14.2015888Z 
2025-05-07T20:26:14.2395649Z libcusparse-12.5.7.5 | 164.9 MB  | ####8      |  49% [A[A
2025-05-07T20:26:14.2413305Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  16% 
2025-05-07T20:26:14.2413598Z 
2025-05-07T20:26:14.2502752Z nsight-compute-2025. | 320.6 MB  | ##5        |  26% [A
2025-05-07T20:26:14.2503449Z 
2025-05-07T20:26:14.2503455Z 
2025-05-07T20:26:14.2503460Z 
2025-05-07T20:26:14.2746631Z libcusolver-11.7.2.5 | 156.9 MB  | ####8      |  48% [A[A[A
2025-05-07T20:26:14.2746912Z 
2025-05-07T20:26:14.2746916Z 
2025-05-07T20:26:14.2746920Z 
2025-05-07T20:26:14.2749708Z 
2025-05-07T20:26:14.3015492Z libcufft-11.3.3.41   | 147.4 MB  | #####3     |  53% [A[A[A[A
2025-05-07T20:26:14.3015767Z 
2025-05-07T20:26:14.3016011Z 
2025-05-07T20:26:14.3418273Z libcusparse-12.5.7.5 | 164.9 MB  | #####1     |  51% [A[A
2025-05-07T20:26:14.3418940Z 
2025-05-07T20:26:14.3426027Z nsight-compute-2025. | 320.6 MB  | ##6        |  27% [A
2025-05-07T20:26:14.3515924Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:26:14.3516227Z 
2025-05-07T20:26:14.3516233Z 
2025-05-07T20:26:14.3517881Z 
2025-05-07T20:26:14.3764924Z libcusolver-11.7.2.5 | 156.9 MB  | #####      |  51% [A[A[A
2025-05-07T20:26:14.3765216Z 
2025-05-07T20:26:14.3765233Z 
2025-05-07T20:26:14.3765238Z 
2025-05-07T20:26:14.3767865Z 
2025-05-07T20:26:14.4015869Z libcufft-11.3.3.41   | 147.4 MB  | #####5     |  56% [A[A[A[A
2025-05-07T20:26:14.4016158Z 
2025-05-07T20:26:14.4016164Z 
2025-05-07T20:26:14.4420669Z libcusparse-12.5.7.5 | 164.9 MB  | #####3     |  54% [A[A
2025-05-07T20:26:14.4421696Z 
2025-05-07T20:26:14.4435964Z nsight-compute-2025. | 320.6 MB  | ##8        |  28% [A
2025-05-07T20:26:14.4517395Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  17% 
2025-05-07T20:26:14.4517658Z 
2025-05-07T20:26:14.4517663Z 
2025-05-07T20:26:14.4520079Z 
2025-05-07T20:26:14.4786261Z libcusolver-11.7.2.5 | 156.9 MB  | #####2     |  53% [A[A[A
2025-05-07T20:26:14.4786552Z 
2025-05-07T20:26:14.4786556Z 
2025-05-07T20:26:14.4786560Z 
2025-05-07T20:26:14.4788765Z 
2025-05-07T20:26:14.5016531Z libcufft-11.3.3.41   | 147.4 MB  | #####8     |  58% [A[A[A[A
2025-05-07T20:26:14.5016831Z 
2025-05-07T20:26:14.5017155Z 
2025-05-07T20:26:14.5518243Z libcusparse-12.5.7.5 | 164.9 MB  | #####5     |  56% [A[A
2025-05-07T20:26:14.5519846Z 
2025-05-07T20:26:14.5559072Z nsight-compute-2025. | 320.6 MB  | ##9        |  29% [A
2025-05-07T20:26:14.5579167Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:26:14.5579416Z 
2025-05-07T20:26:14.5579422Z 
2025-05-07T20:26:14.5580940Z 
2025-05-07T20:26:14.5788906Z libcusolver-11.7.2.5 | 156.9 MB  | #####4     |  55% [A[A[A
2025-05-07T20:26:14.5789335Z 
2025-05-07T20:26:14.5789343Z 
2025-05-07T20:26:14.5789351Z 
2025-05-07T20:26:14.5789359Z 
2025-05-07T20:26:14.6016713Z libcufft-11.3.3.41   | 147.4 MB  | ######     |  61% [A[A[A[A
2025-05-07T20:26:14.6016999Z 
2025-05-07T20:26:14.6017004Z 
2025-05-07T20:26:14.6524942Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  58% [A[A
2025-05-07T20:26:14.6526719Z 
2025-05-07T20:26:14.6580772Z nsight-compute-2025. | 320.6 MB  | ###        |  31% [A
2025-05-07T20:26:14.6581118Z 
2025-05-07T20:26:14.6581123Z 
2025-05-07T20:26:14.6581129Z 
2025-05-07T20:26:14.6790646Z libcusolver-11.7.2.5 | 156.9 MB  | #####7     |  57% [A[A[A
2025-05-07T20:26:14.6791086Z 
2025-05-07T20:26:14.6791092Z 
2025-05-07T20:26:14.6791108Z 
2025-05-07T20:26:14.6791113Z 
2025-05-07T20:26:14.6898326Z libcufft-11.3.3.41   | 147.4 MB  | ######3    |  64% [A[A[A[A
2025-05-07T20:26:14.7018146Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  19% 
2025-05-07T20:26:14.7018450Z 
2025-05-07T20:26:14.7019699Z 
2025-05-07T20:26:14.7527697Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:26:14.7530520Z 
2025-05-07T20:26:14.7582034Z nsight-compute-2025. | 320.6 MB  | ###1       |  32% [A
2025-05-07T20:26:14.7582308Z 
2025-05-07T20:26:14.7582314Z 
2025-05-07T20:26:14.7585609Z 
2025-05-07T20:26:14.7793868Z libcusolver-11.7.2.5 | 156.9 MB  | #####9     |  60% [A[A[A
2025-05-07T20:26:14.7794144Z 
2025-05-07T20:26:14.7794148Z 
2025-05-07T20:26:14.7794175Z 
2025-05-07T20:26:14.7798619Z 
2025-05-07T20:26:14.7961054Z libcufft-11.3.3.41   | 147.4 MB  | ######6    |  67% [A[A[A[A
2025-05-07T20:26:14.8053880Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  19% 
2025-05-07T20:26:14.8054181Z 
2025-05-07T20:26:14.8054869Z 
2025-05-07T20:26:14.8583614Z libcusparse-12.5.7.5 | 164.9 MB  | ######3    |  63% [A[A
2025-05-07T20:26:14.8587936Z 
2025-05-07T20:26:14.8593410Z nsight-compute-2025. | 320.6 MB  | ###2       |  33% [A
2025-05-07T20:26:14.8593683Z 
2025-05-07T20:26:14.8593687Z 
2025-05-07T20:26:14.8593690Z 
2025-05-07T20:26:14.8861154Z libcusolver-11.7.2.5 | 156.9 MB  | ######2    |  62% [A[A[A
2025-05-07T20:26:14.8861569Z 
2025-05-07T20:26:14.8861575Z 
2025-05-07T20:26:14.8861591Z 
2025-05-07T20:26:14.8862526Z 
2025-05-07T20:26:14.8962559Z libcufft-11.3.3.41   | 147.4 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:26:14.9054324Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  20% 
2025-05-07T20:26:14.9054572Z 
2025-05-07T20:26:14.9055005Z 
2025-05-07T20:26:14.9587161Z libcusparse-12.5.7.5 | 164.9 MB  | ######5    |  65% [A[A
2025-05-07T20:26:14.9587437Z 
2025-05-07T20:26:14.9602780Z nsight-compute-2025. | 320.6 MB  | ###4       |  34% [A
2025-05-07T20:26:14.9603055Z 
2025-05-07T20:26:14.9603059Z 
2025-05-07T20:26:14.9606271Z 
2025-05-07T20:26:14.9970926Z libcusolver-11.7.2.5 | 156.9 MB  | ######4    |  64% [A[A[A
2025-05-07T20:26:15.0074127Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  21% 
2025-05-07T20:26:15.0074368Z 
2025-05-07T20:26:15.0074371Z 
2025-05-07T20:26:15.0074375Z 
2025-05-07T20:26:15.0074379Z 
2025-05-07T20:26:15.0221829Z libcufft-11.3.3.41   | 147.4 MB  | #######1   |  72% [A[A[A[A
2025-05-07T20:26:15.0222095Z 
2025-05-07T20:26:15.0224111Z 
2025-05-07T20:26:15.0675749Z libcusparse-12.5.7.5 | 164.9 MB  | ######7    |  68% [A[A
2025-05-07T20:26:15.0676025Z 
2025-05-07T20:26:15.0676029Z 
2025-05-07T20:26:15.0676032Z 
2025-05-07T20:26:15.0718635Z libcusolver-11.7.2.5 | 156.9 MB  | ######6    |  67% [A[A[A
2025-05-07T20:26:15.0726401Z 
2025-05-07T20:26:15.0972876Z nsight-compute-2025. | 320.6 MB  | ###5       |  35% [A
2025-05-07T20:26:15.1251690Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  22% 
2025-05-07T20:26:15.1251954Z 
2025-05-07T20:26:15.1251981Z 
2025-05-07T20:26:15.1251985Z 
2025-05-07T20:26:15.1255346Z 
2025-05-07T20:26:15.1343940Z libcufft-11.3.3.41   | 147.4 MB  | #######4   |  74% [A[A[A[A
2025-05-07T20:26:15.1344218Z 
2025-05-07T20:26:15.1344222Z 
2025-05-07T20:26:15.1701888Z libcusparse-12.5.7.5 | 164.9 MB  | #######    |  70% [A[A
2025-05-07T20:26:15.1702164Z 
2025-05-07T20:26:15.1702169Z 
2025-05-07T20:26:15.1702172Z 
2025-05-07T20:26:15.1836435Z libcusolver-11.7.2.5 | 156.9 MB  | ######8    |  69% [A[A[A
2025-05-07T20:26:15.1840026Z 
2025-05-07T20:26:15.1982596Z nsight-compute-2025. | 320.6 MB  | ###6       |  36% [A
2025-05-07T20:26:15.2258336Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:26:15.2258586Z 
2025-05-07T20:26:15.2258805Z 
2025-05-07T20:26:15.2258812Z 
2025-05-07T20:26:15.2259847Z 
2025-05-07T20:26:15.2386312Z libcufft-11.3.3.41   | 147.4 MB  | #######6   |  77% [A[A[A[A
2025-05-07T20:26:15.2386591Z 
2025-05-07T20:26:15.2387958Z 
2025-05-07T20:26:15.2794989Z libcusparse-12.5.7.5 | 164.9 MB  | #######2   |  72% [A[A
2025-05-07T20:26:15.2795252Z 
2025-05-07T20:26:15.2795256Z 
2025-05-07T20:26:15.2795260Z 
2025-05-07T20:26:15.2841020Z libcusolver-11.7.2.5 | 156.9 MB  | #######1   |  71% [A[A[A
2025-05-07T20:26:15.2842709Z 
2025-05-07T20:26:15.3041469Z nsight-compute-2025. | 320.6 MB  | ###7       |  38% [A
2025-05-07T20:26:15.3270702Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  23% 
2025-05-07T20:26:15.3270947Z 
2025-05-07T20:26:15.3271241Z 
2025-05-07T20:26:15.3271272Z 
2025-05-07T20:26:15.3271361Z 
2025-05-07T20:26:15.3391400Z libcufft-11.3.3.41   | 147.4 MB  | #######9   |  79% [A[A[A[A
2025-05-07T20:26:15.3391685Z 
2025-05-07T20:26:15.3391690Z 
2025-05-07T20:26:15.3799166Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  75% [A[A
2025-05-07T20:26:15.3799472Z 
2025-05-07T20:26:15.3799477Z 
2025-05-07T20:26:15.3799481Z 
2025-05-07T20:26:15.3878100Z libcusolver-11.7.2.5 | 156.9 MB  | #######3   |  73% [A[A[A
2025-05-07T20:26:15.3878722Z 
2025-05-07T20:26:15.4131577Z nsight-compute-2025. | 320.6 MB  | ###8       |  39% [A
2025-05-07T20:26:15.4272708Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:26:15.4273091Z 
2025-05-07T20:26:15.4273098Z 
2025-05-07T20:26:15.4273103Z 
2025-05-07T20:26:15.4273108Z 
2025-05-07T20:26:15.4391729Z libcufft-11.3.3.41   | 147.4 MB  | ########1  |  82% [A[A[A[A
2025-05-07T20:26:15.4392020Z 
2025-05-07T20:26:15.4392023Z 
2025-05-07T20:26:15.4900239Z libcusparse-12.5.7.5 | 164.9 MB  | #######6   |  77% [A[A
2025-05-07T20:26:15.4900667Z 
2025-05-07T20:26:15.4918933Z nsight-compute-2025. | 320.6 MB  | ###9       |  40% [A
2025-05-07T20:26:15.4919254Z 
2025-05-07T20:26:15.4919258Z 
2025-05-07T20:26:15.4921340Z 
2025-05-07T20:26:15.5251177Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  76% [A[A[A
2025-05-07T20:26:15.5273796Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  25% 
2025-05-07T20:26:15.5274139Z 
2025-05-07T20:26:15.5274163Z 
2025-05-07T20:26:15.5274169Z 
2025-05-07T20:26:15.5278248Z 
2025-05-07T20:26:15.5444884Z libcufft-11.3.3.41   | 147.4 MB  | ########4  |  84% [A[A[A[A
2025-05-07T20:26:15.5445298Z 
2025-05-07T20:26:15.5448272Z 
2025-05-07T20:26:15.5905503Z libcusparse-12.5.7.5 | 164.9 MB  | #######8   |  79% [A[A
2025-05-07T20:26:15.5907434Z 
2025-05-07T20:26:15.5920041Z nsight-compute-2025. | 320.6 MB  | ####       |  41% [A
2025-05-07T20:26:15.5920307Z 
2025-05-07T20:26:15.5920311Z 
2025-05-07T20:26:15.5921580Z 
2025-05-07T20:26:15.6253733Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  78% [A[A[A
2025-05-07T20:26:15.6274268Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:26:15.6274627Z 
2025-05-07T20:26:15.6274631Z 
2025-05-07T20:26:15.6274635Z 
2025-05-07T20:26:15.6274638Z 
2025-05-07T20:26:15.6446358Z libcufft-11.3.3.41   | 147.4 MB  | ########6  |  87% [A[A[A[A
2025-05-07T20:26:15.6446710Z 
2025-05-07T20:26:15.6449109Z 
2025-05-07T20:26:15.6922530Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  81% [A[A
2025-05-07T20:26:15.6922934Z 
2025-05-07T20:26:15.6922940Z 
2025-05-07T20:26:15.6925439Z 
2025-05-07T20:26:15.7072621Z libcusolver-11.7.2.5 | 156.9 MB  | #######9   |  80% [A[A[A
2025-05-07T20:26:15.7072956Z 
2025-05-07T20:26:15.7255542Z nsight-compute-2025. | 320.6 MB  | ####2      |  42% [A
2025-05-07T20:26:15.7327059Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:26:15.7327323Z 
2025-05-07T20:26:15.7327329Z 
2025-05-07T20:26:15.7327335Z 
2025-05-07T20:26:15.7329980Z 
2025-05-07T20:26:15.7447401Z libcufft-11.3.3.41   | 147.4 MB  | ########9  |  89% [A[A[A[A
2025-05-07T20:26:15.7447794Z 
2025-05-07T20:26:15.7447805Z 
2025-05-07T20:26:15.7924174Z libcusparse-12.5.7.5 | 164.9 MB  | ########3  |  83% [A[A
2025-05-07T20:26:15.7924477Z 
2025-05-07T20:26:15.7924482Z 
2025-05-07T20:26:15.7925678Z 
2025-05-07T20:26:15.8091421Z libcusolver-11.7.2.5 | 156.9 MB  | ########2  |  82% [A[A[A
2025-05-07T20:26:15.8091825Z 
2025-05-07T20:26:15.8260245Z nsight-compute-2025. | 320.6 MB  | ####3      |  43% [A
2025-05-07T20:26:15.8422008Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  27% 
2025-05-07T20:26:15.8422382Z 
2025-05-07T20:26:15.8422391Z 
2025-05-07T20:26:15.8422396Z 
2025-05-07T20:26:15.8425231Z 
2025-05-07T20:26:15.8450248Z libcufft-11.3.3.41   | 147.4 MB  | #########1 |  92% [A[A[A[A
2025-05-07T20:26:15.8450582Z 
2025-05-07T20:26:15.8453443Z 
2025-05-07T20:26:15.8925204Z libcusparse-12.5.7.5 | 164.9 MB  | ########5  |  86% [A[A
2025-05-07T20:26:15.8925492Z 
2025-05-07T20:26:15.8925496Z 
2025-05-07T20:26:15.8928511Z 
2025-05-07T20:26:15.9092453Z libcusolver-11.7.2.5 | 156.9 MB  | ########4  |  85% [A[A[A
2025-05-07T20:26:15.9092733Z 
2025-05-07T20:26:15.9264065Z nsight-compute-2025. | 320.6 MB  | ####4      |  44% [A
2025-05-07T20:26:15.9424718Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  28% 
2025-05-07T20:26:15.9425018Z 
2025-05-07T20:26:15.9425022Z 
2025-05-07T20:26:15.9425026Z 
2025-05-07T20:26:15.9425745Z 
2025-05-07T20:26:15.9484103Z libcufft-11.3.3.41   | 147.4 MB  | #########3 |  94% [A[A[A[A
2025-05-07T20:26:15.9484490Z 
2025-05-07T20:26:15.9484500Z 
2025-05-07T20:26:15.9934843Z libcusparse-12.5.7.5 | 164.9 MB  | ########7  |  88% [A[A
2025-05-07T20:26:15.9935170Z 
2025-05-07T20:26:15.9935176Z 
2025-05-07T20:26:15.9936227Z 
2025-05-07T20:26:16.0099048Z libcusolver-11.7.2.5 | 156.9 MB  | ########7  |  87% [A[A[A
2025-05-07T20:26:16.0099383Z 
2025-05-07T20:26:16.0287367Z nsight-compute-2025. | 320.6 MB  | ####5      |  45% [A
2025-05-07T20:26:16.0494397Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  28% 
2025-05-07T20:26:16.0494641Z 
2025-05-07T20:26:16.0494647Z 
2025-05-07T20:26:16.0494660Z 
2025-05-07T20:26:16.0497823Z 
2025-05-07T20:26:16.0519559Z libcufft-11.3.3.41   | 147.4 MB  | #########6 |  96% [A[A[A[A
2025-05-07T20:26:16.0519824Z 
2025-05-07T20:26:16.0519828Z 
2025-05-07T20:26:16.1080151Z libcusparse-12.5.7.5 | 164.9 MB  | #########  |  90% [A[A
2025-05-07T20:26:16.1080447Z 
2025-05-07T20:26:16.1080469Z 
2025-05-07T20:26:16.1080475Z 
2025-05-07T20:26:16.1144341Z libcusolver-11.7.2.5 | 156.9 MB  | ########9  |  89% [A[A[A
2025-05-07T20:26:16.1145165Z 
2025-05-07T20:26:16.1395080Z nsight-compute-2025. | 320.6 MB  | ####6      |  46% [A
2025-05-07T20:26:16.1520669Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  29% 
2025-05-07T20:26:16.1520929Z 
2025-05-07T20:26:16.1521913Z 
2025-05-07T20:26:16.1565043Z libcusparse-12.5.7.5 | 164.9 MB  | #########2 |  92% [A[A
2025-05-07T20:26:16.1565307Z 
2025-05-07T20:26:16.1565571Z 
2025-05-07T20:26:16.1565579Z 
2025-05-07T20:26:16.1565604Z 
2025-05-07T20:26:16.2145337Z libcufft-11.3.3.41   | 147.4 MB  | #########8 |  99% [A[A[A[A
2025-05-07T20:26:16.2147314Z 
2025-05-07T20:26:16.2213844Z nsight-compute-2025. | 320.6 MB  | ####7      |  47% [A
2025-05-07T20:26:16.2214113Z 
2025-05-07T20:26:16.2214117Z 
2025-05-07T20:26:16.2214152Z 
2025-05-07T20:26:16.2396910Z libcusolver-11.7.2.5 | 156.9 MB  | #########1 |  92% [A[A[A
2025-05-07T20:26:16.2521303Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  30% 
2025-05-07T20:26:16.2521635Z 
2025-05-07T20:26:16.2522692Z 
2025-05-07T20:26:16.3146927Z libcusparse-12.5.7.5 | 164.9 MB  | #########4 |  95% [A[A
2025-05-07T20:26:16.3152350Z 
2025-05-07T20:26:16.3217834Z nsight-compute-2025. | 320.6 MB  | ####8      |  49% [A
2025-05-07T20:26:16.3218109Z 
2025-05-07T20:26:16.3218113Z 
2025-05-07T20:26:16.3219252Z 
2025-05-07T20:26:16.3401297Z libcusolver-11.7.2.5 | 156.9 MB  | #########4 |  94% [A[A[A
2025-05-07T20:26:16.3522949Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  31% 
2025-05-07T20:26:16.3523277Z 
2025-05-07T20:26:16.3523294Z 
2025-05-07T20:26:16.4150612Z libcusparse-12.5.7.5 | 164.9 MB  | #########6 |  97% [A[A
2025-05-07T20:26:16.4153413Z 
2025-05-07T20:26:16.4221684Z nsight-compute-2025. | 320.6 MB  | ####9      |  50% [A
2025-05-07T20:26:16.4221954Z 
2025-05-07T20:26:16.4222235Z 
2025-05-07T20:26:16.4227170Z 
2025-05-07T20:26:16.4402534Z libcusolver-11.7.2.5 | 156.9 MB  | #########6 |  96% [A[A[A
2025-05-07T20:26:16.4523062Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  32% 
2025-05-07T20:26:16.4523346Z 
2025-05-07T20:26:16.4525871Z 
2025-05-07T20:26:16.5151515Z libcusparse-12.5.7.5 | 164.9 MB  | #########9 |  99% [A[A
2025-05-07T20:26:16.5154151Z 
2025-05-07T20:26:16.5223645Z nsight-compute-2025. | 320.6 MB  | #####      |  51% [A
2025-05-07T20:26:16.5223916Z 
2025-05-07T20:26:16.5223921Z 
2025-05-07T20:26:16.5224533Z 
2025-05-07T20:26:16.5539018Z libcusolver-11.7.2.5 | 156.9 MB  | #########9 |  99% [A[A[A
2025-05-07T20:26:16.6151330Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  32% 
2025-05-07T20:26:16.6155576Z 
2025-05-07T20:26:16.6539457Z nsight-compute-2025. | 320.6 MB  | #####2     |  52% [A
2025-05-07T20:26:16.7151733Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:26:16.7153171Z 
2025-05-07T20:26:16.7545670Z nsight-compute-2025. | 320.6 MB  | #####3     |  54% [A
2025-05-07T20:26:16.8153600Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:26:16.8155292Z 
2025-05-07T20:26:16.8546781Z nsight-compute-2025. | 320.6 MB  | #####5     |  55% [A
2025-05-07T20:26:16.9163640Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  36% 
2025-05-07T20:26:16.9165905Z 
2025-05-07T20:26:16.9547235Z nsight-compute-2025. | 320.6 MB  | #####6     |  57% [A
2025-05-07T20:26:17.0439806Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:26:17.0440748Z 
2025-05-07T20:26:17.0551895Z nsight-compute-2025. | 320.6 MB  | #####8     |  58% [A
2025-05-07T20:26:17.1443344Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  38% 
2025-05-07T20:26:17.1445313Z 
2025-05-07T20:26:17.1552232Z nsight-compute-2025. | 320.6 MB  | #####9     |  60% [A
2025-05-07T20:26:17.2444960Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  40% 
2025-05-07T20:26:17.2447360Z 
2025-05-07T20:26:17.2553256Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:26:17.3554908Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  41% 
2025-05-07T20:26:17.3814584Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  42% 
2025-05-07T20:26:17.3814835Z 
2025-05-07T20:26:17.4685427Z nsight-compute-2025. | 320.6 MB  | ######2    |  63% [A
2025-05-07T20:26:17.4817648Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  44% 
2025-05-07T20:26:17.4818545Z 
2025-05-07T20:26:17.5749666Z nsight-compute-2025. | 320.6 MB  | ######3    |  64% [A
2025-05-07T20:26:17.5820513Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  45% 
2025-05-07T20:26:17.5820851Z 
2025-05-07T20:26:17.6822221Z nsight-compute-2025. | 320.6 MB  | ######5    |  65% [A
2025-05-07T20:26:17.6822641Z 
2025-05-07T20:26:17.7091600Z nsight-compute-2025. | 320.6 MB  | ######7    |  67% [A
2025-05-07T20:26:17.7822688Z libcublas-12.8.3.14  | 460.2 MB  | ####6      |  46% 
2025-05-07T20:26:17.7823062Z 
2025-05-07T20:26:17.8282623Z nsight-compute-2025. | 320.6 MB  | ######8    |  69% [A
2025-05-07T20:26:17.8830026Z libcublas-12.8.3.14  | 460.2 MB  | ####7      |  47% 
2025-05-07T20:26:17.8831550Z 
2025-05-07T20:26:17.9302370Z nsight-compute-2025. | 320.6 MB  | #######    |  70% [A
2025-05-07T20:26:17.9862915Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  48% 
2025-05-07T20:26:17.9865287Z 
2025-05-07T20:26:18.0303921Z nsight-compute-2025. | 320.6 MB  | #######1   |  72% [A
2025-05-07T20:26:18.0866236Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  49% 
2025-05-07T20:26:18.0866487Z 
2025-05-07T20:26:18.1304955Z nsight-compute-2025. | 320.6 MB  | #######3   |  73% [A
2025-05-07T20:26:18.1871013Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  51% 
2025-05-07T20:26:18.1873264Z 
2025-05-07T20:26:18.2306302Z nsight-compute-2025. | 320.6 MB  | #######4   |  75% [A
2025-05-07T20:26:18.2873298Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  52% 
2025-05-07T20:26:18.2874864Z 
2025-05-07T20:26:18.3310358Z nsight-compute-2025. | 320.6 MB  | #######6   |  76% [A
2025-05-07T20:26:18.3877754Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  53% 
2025-05-07T20:26:18.3878009Z 
2025-05-07T20:26:18.4327789Z nsight-compute-2025. | 320.6 MB  | #######7   |  78% [A
2025-05-07T20:26:18.4877913Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  54% 
2025-05-07T20:26:18.4878317Z 
2025-05-07T20:26:18.5342807Z nsight-compute-2025. | 320.6 MB  | #######9   |  79% [A
2025-05-07T20:26:18.5880376Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  55% 
2025-05-07T20:26:18.5880698Z 
2025-05-07T20:26:18.6350787Z nsight-compute-2025. | 320.6 MB  | ########   |  81% [A
2025-05-07T20:26:18.6957112Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  56% 
2025-05-07T20:26:18.6957546Z 
2025-05-07T20:26:18.7350613Z nsight-compute-2025. | 320.6 MB  | ########2  |  82% [A
2025-05-07T20:26:18.7983912Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:26:18.7984846Z 
2025-05-07T20:26:18.8433100Z nsight-compute-2025. | 320.6 MB  | ########3  |  84% [A
2025-05-07T20:26:18.9012263Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:26:18.9012510Z 
2025-05-07T20:26:18.9471387Z nsight-compute-2025. | 320.6 MB  | ########5  |  86% [A
2025-05-07T20:26:18.9471907Z 
2025-05-07T20:26:18.9471911Z 
2025-05-07T20:26:18.9471915Z 
2025-05-07T20:26:18.9471919Z 
2025-05-07T20:26:18.9645798Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:18.9886877Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  60% 
2025-05-07T20:26:18.9887126Z 
2025-05-07T20:26:18.9887130Z 
2025-05-07T20:26:18.9887134Z 
2025-05-07T20:26:18.9887137Z 
2025-05-07T20:26:18.9888933Z 
2025-05-07T20:26:19.0217173Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:19.0218834Z 
2025-05-07T20:26:19.0887582Z nsight-compute-2025. | 320.6 MB  | ########7  |  87% [A
2025-05-07T20:26:19.0887917Z 
2025-05-07T20:26:19.0887921Z 
2025-05-07T20:26:19.0887925Z 
2025-05-07T20:26:19.0887929Z 
2025-05-07T20:26:19.0887933Z 
2025-05-07T20:26:19.1219317Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:26:19.1545982Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  61% 
2025-05-07T20:26:19.1547951Z 
2025-05-07T20:26:19.1896212Z nsight-compute-2025. | 320.6 MB  | ########8  |  89% [A
2025-05-07T20:26:19.1896542Z 
2025-05-07T20:26:19.1896548Z 
2025-05-07T20:26:19.1896553Z 
2025-05-07T20:26:19.1896558Z 
2025-05-07T20:26:19.1900256Z 
2025-05-07T20:26:19.2627997Z libnpp-12.3.3.65     | 130.6 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:26:19.2823632Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  62% 
2025-05-07T20:26:19.2827006Z 
2025-05-07T20:26:19.2904936Z nsight-compute-2025. | 320.6 MB  | ########9  |  90% [A
2025-05-07T20:26:19.2905309Z 
2025-05-07T20:26:19.2905315Z 
2025-05-07T20:26:19.2905320Z 
2025-05-07T20:26:19.2905325Z 
2025-05-07T20:26:19.2905331Z 
2025-05-07T20:26:19.3905658Z libnpp-12.3.3.65     | 130.6 MB  | 8          |   8% [A[A[A[A[A
2025-05-07T20:26:19.3906051Z 
2025-05-07T20:26:19.3906057Z 
2025-05-07T20:26:19.3906062Z 
2025-05-07T20:26:19.3906088Z 
2025-05-07T20:26:19.3908982Z 
2025-05-07T20:26:19.3937264Z libnpp-12.3.3.65     | 130.6 MB  | #1         |  11% [A[A[A[A[A
2025-05-07T20:26:19.3988717Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  63% 
2025-05-07T20:26:19.3989069Z 
2025-05-07T20:26:19.4913216Z nsight-compute-2025. | 320.6 MB  | #########1 |  91% [A
2025-05-07T20:26:19.4913583Z 
2025-05-07T20:26:19.4913590Z 
2025-05-07T20:26:19.4913595Z 
2025-05-07T20:26:19.4913615Z 
2025-05-07T20:26:19.4916836Z 
2025-05-07T20:26:19.5049341Z libnpp-12.3.3.65     | 130.6 MB  | #3         |  14% [A[A[A[A[A
2025-05-07T20:26:19.5148757Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:26:19.5153277Z 
2025-05-07T20:26:19.5919775Z nsight-compute-2025. | 320.6 MB  | #########2 |  92% [A
2025-05-07T20:26:19.5920156Z 
2025-05-07T20:26:19.5920162Z 
2025-05-07T20:26:19.5920168Z 
2025-05-07T20:26:19.5920173Z 
2025-05-07T20:26:19.5920266Z 
2025-05-07T20:26:19.6151453Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  16% [A[A[A[A[A
2025-05-07T20:26:19.6155098Z 
2025-05-07T20:26:19.6165781Z nsight-compute-2025. | 320.6 MB  | #########3 |  93% [A
2025-05-07T20:26:19.6166148Z 
2025-05-07T20:26:19.6166157Z 
2025-05-07T20:26:19.6168198Z 
2025-05-07T20:26:19.6204630Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:19.6335141Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  65% 
2025-05-07T20:26:19.6335408Z 
2025-05-07T20:26:19.6345085Z 
2025-05-07T20:26:19.6676293Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:19.6676656Z 
2025-05-07T20:26:19.6676662Z 
2025-05-07T20:26:19.6676667Z 
2025-05-07T20:26:19.6676672Z 
2025-05-07T20:26:19.6676678Z 
2025-05-07T20:26:19.6676682Z 
2025-05-07T20:26:19.6678452Z 
2025-05-07T20:26:19.6713227Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:19.6713500Z 
2025-05-07T20:26:19.6713504Z 
2025-05-07T20:26:19.6713508Z 
2025-05-07T20:26:19.6713511Z 
2025-05-07T20:26:19.6713515Z 
2025-05-07T20:26:19.6713542Z 
2025-05-07T20:26:19.7160893Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:19.7161236Z 
2025-05-07T20:26:19.7161499Z 
2025-05-07T20:26:19.7161504Z 
2025-05-07T20:26:19.7161520Z 
2025-05-07T20:26:19.7168137Z 
2025-05-07T20:26:19.7487311Z libnpp-12.3.3.65     | 130.6 MB  | #9         |  19% [A[A[A[A[A
2025-05-07T20:26:19.7487706Z 
2025-05-07T20:26:19.7681567Z nsight-compute-2025. | 320.6 MB  | #########4 |  95% [A
2025-05-07T20:26:19.7682174Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  65% 
2025-05-07T20:26:19.7682522Z 
2025-05-07T20:26:19.7682528Z 
2025-05-07T20:26:19.7682534Z 
2025-05-07T20:26:19.7682539Z 
2025-05-07T20:26:19.7682544Z 
2025-05-07T20:26:19.7682549Z 
2025-05-07T20:26:19.7682559Z 
2025-05-07T20:26:19.7713626Z cuda-nvvp-12.8.57    | 112.4 MB  | 1          |   2% [A[A[A[A[A[A[A
2025-05-07T20:26:19.7713915Z 
2025-05-07T20:26:19.7713919Z 
2025-05-07T20:26:19.7713922Z 
2025-05-07T20:26:19.7713926Z 
2025-05-07T20:26:19.7713930Z 
2025-05-07T20:26:19.7719517Z 
2025-05-07T20:26:19.8517219Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   2% [A[A[A[A[A[A
2025-05-07T20:26:19.8517534Z 
2025-05-07T20:26:19.8517538Z 
2025-05-07T20:26:19.8517542Z 
2025-05-07T20:26:19.8517545Z 
2025-05-07T20:26:19.8517829Z 
2025-05-07T20:26:19.8687971Z libnpp-12.3.3.65     | 130.6 MB  | ##1        |  21% [A[A[A[A[A
2025-05-07T20:26:19.8688384Z 
2025-05-07T20:26:19.8688391Z 
2025-05-07T20:26:19.8688397Z 
2025-05-07T20:26:19.8688402Z 
2025-05-07T20:26:19.8688408Z 
2025-05-07T20:26:19.8688414Z 
2025-05-07T20:26:19.8688420Z 
2025-05-07T20:26:19.8723104Z cuda-nvvp-12.8.57    | 112.4 MB  | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:26:19.8723409Z 
2025-05-07T20:26:19.8723414Z 
2025-05-07T20:26:19.8723420Z 
2025-05-07T20:26:19.8723425Z 
2025-05-07T20:26:19.8723430Z 
2025-05-07T20:26:19.8725664Z 
2025-05-07T20:26:19.8869809Z cuda-nsight-12.8.55  | 113.2 MB  | 4          |   5% [A[A[A[A[A[A
2025-05-07T20:26:19.8870101Z 
2025-05-07T20:26:19.9140411Z nsight-compute-2025. | 320.6 MB  | #########5 |  96% [A
2025-05-07T20:26:19.9696510Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:26:19.9696795Z 
2025-05-07T20:26:19.9696799Z 
2025-05-07T20:26:19.9696803Z 
2025-05-07T20:26:19.9696807Z 
2025-05-07T20:26:19.9696810Z 
2025-05-07T20:26:19.9696814Z 
2025-05-07T20:26:19.9702612Z 
2025-05-07T20:26:19.9717116Z cuda-nvvp-12.8.57    | 112.4 MB  | 6          |   6% [A[A[A[A[A[A[A
2025-05-07T20:26:19.9717423Z 
2025-05-07T20:26:19.9717429Z 
2025-05-07T20:26:19.9717434Z 
2025-05-07T20:26:19.9717439Z 
2025-05-07T20:26:19.9717444Z 
2025-05-07T20:26:19.9726776Z libnpp-12.3.3.65     | 130.6 MB  | ##3        |  24% [A[A[A[A[A
2025-05-07T20:26:19.9727057Z 
2025-05-07T20:26:19.9727061Z 
2025-05-07T20:26:19.9727065Z 
2025-05-07T20:26:19.9727069Z 
2025-05-07T20:26:19.9727072Z 
2025-05-07T20:26:19.9731425Z 
2025-05-07T20:26:20.0335955Z cuda-nsight-12.8.55  | 113.2 MB  | 6          |   7% [A[A[A[A[A[A
2025-05-07T20:26:20.0480421Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:26:20.0482721Z 
2025-05-07T20:26:20.0697244Z nsight-compute-2025. | 320.6 MB  | #########6 |  97% [A
2025-05-07T20:26:20.0697612Z 
2025-05-07T20:26:20.0697618Z 
2025-05-07T20:26:20.0697623Z 
2025-05-07T20:26:20.0697628Z 
2025-05-07T20:26:20.0697634Z 
2025-05-07T20:26:20.0697639Z 
2025-05-07T20:26:20.0708836Z 
2025-05-07T20:26:20.0729875Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:26:20.0730166Z 
2025-05-07T20:26:20.0730170Z 
2025-05-07T20:26:20.0730173Z 
2025-05-07T20:26:20.0730177Z 
2025-05-07T20:26:20.0730180Z 
2025-05-07T20:26:20.0732386Z 
2025-05-07T20:26:20.0942035Z cuda-nsight-12.8.55  | 113.2 MB  | 9          |   9% [A[A[A[A[A[A
2025-05-07T20:26:20.0942380Z 
2025-05-07T20:26:20.0942384Z 
2025-05-07T20:26:20.0942388Z 
2025-05-07T20:26:20.0942392Z 
2025-05-07T20:26:20.0942395Z 
2025-05-07T20:26:20.1335727Z libnpp-12.3.3.65     | 130.6 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:26:20.1698912Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  67% 
2025-05-07T20:26:20.1699293Z 
2025-05-07T20:26:20.1699311Z 
2025-05-07T20:26:20.1699580Z 
2025-05-07T20:26:20.1699584Z 
2025-05-07T20:26:20.1699587Z 
2025-05-07T20:26:20.1699591Z 
2025-05-07T20:26:20.1699594Z 
2025-05-07T20:26:20.1731542Z cuda-nvvp-12.8.57    | 112.4 MB  | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:26:20.1731836Z 
2025-05-07T20:26:20.1731840Z 
2025-05-07T20:26:20.1731844Z 
2025-05-07T20:26:20.1731847Z 
2025-05-07T20:26:20.1731851Z 
2025-05-07T20:26:20.1739268Z 
2025-05-07T20:26:20.1741505Z cuda-nsight-12.8.55  | 113.2 MB  | #1         |  11% [A[A[A[A[A[A
2025-05-07T20:26:20.1744003Z 
2025-05-07T20:26:20.1944974Z nsight-compute-2025. | 320.6 MB  | #########7 |  98% [A
2025-05-07T20:26:20.1945258Z 
2025-05-07T20:26:20.1945262Z 
2025-05-07T20:26:20.1945266Z 
2025-05-07T20:26:20.1945270Z 
2025-05-07T20:26:20.1945275Z 
2025-05-07T20:26:20.2480278Z libnpp-12.3.3.65     | 130.6 MB  | ##7        |  28% [A[A[A[A[A
2025-05-07T20:26:20.2710112Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  68% 
2025-05-07T20:26:20.2710461Z 
2025-05-07T20:26:20.2710467Z 
2025-05-07T20:26:20.2710487Z 
2025-05-07T20:26:20.2710492Z 
2025-05-07T20:26:20.2710498Z 
2025-05-07T20:26:20.2710503Z 
2025-05-07T20:26:20.2710517Z 
2025-05-07T20:26:20.2804925Z cuda-nvvp-12.8.57    | 112.4 MB  | #3         |  13% [A[A[A[A[A[A[A
2025-05-07T20:26:20.2805363Z 
2025-05-07T20:26:20.2805370Z 
2025-05-07T20:26:20.2805377Z 
2025-05-07T20:26:20.2805383Z 
2025-05-07T20:26:20.2805405Z 
2025-05-07T20:26:20.2807226Z 
2025-05-07T20:26:20.2864835Z cuda-nsight-12.8.55  | 113.2 MB  | #3         |  13% [A[A[A[A[A[A
2025-05-07T20:26:20.2867338Z 
2025-05-07T20:26:20.3037508Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:26:20.3037892Z 
2025-05-07T20:26:20.3037898Z 
2025-05-07T20:26:20.3037903Z 
2025-05-07T20:26:20.3037908Z 
2025-05-07T20:26:20.3037913Z 
2025-05-07T20:26:20.3553728Z libnpp-12.3.3.65     | 130.6 MB  | ##9        |  30% [A[A[A[A[A
2025-05-07T20:26:20.3710427Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:26:20.3710694Z 
2025-05-07T20:26:20.3710711Z 
2025-05-07T20:26:20.3710714Z 
2025-05-07T20:26:20.3710718Z 
2025-05-07T20:26:20.3710721Z 
2025-05-07T20:26:20.3710725Z 
2025-05-07T20:26:20.3717773Z 
2025-05-07T20:26:20.3853465Z cuda-nvvp-12.8.57    | 112.4 MB  | #5         |  15% [A[A[A[A[A[A[A
2025-05-07T20:26:20.3853911Z 
2025-05-07T20:26:20.3853915Z 
2025-05-07T20:26:20.3853926Z 
2025-05-07T20:26:20.3853930Z 
2025-05-07T20:26:20.3853933Z 
2025-05-07T20:26:20.3855933Z 
2025-05-07T20:26:20.3955293Z cuda-nsight-12.8.55  | 113.2 MB  | #5         |  15% [A[A[A[A[A[A
2025-05-07T20:26:20.3959135Z 
2025-05-07T20:26:20.4044676Z nsight-compute-2025. | 320.6 MB  | #########9 |  99% [A
2025-05-07T20:26:20.4044930Z 
2025-05-07T20:26:20.4044934Z 
2025-05-07T20:26:20.4044938Z 
2025-05-07T20:26:20.4044941Z 
2025-05-07T20:26:20.4047346Z 
2025-05-07T20:26:20.4734108Z libnpp-12.3.3.65     | 130.6 MB  | ###1       |  32% [A[A[A[A[A
2025-05-07T20:26:20.4888256Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  69% 
2025-05-07T20:26:20.4888619Z 
2025-05-07T20:26:20.4888646Z 
2025-05-07T20:26:20.4888652Z 
2025-05-07T20:26:20.4888658Z 
2025-05-07T20:26:20.4888663Z 
2025-05-07T20:26:20.4888668Z 
2025-05-07T20:26:20.4888673Z 
2025-05-07T20:26:20.5007175Z cuda-nvvp-12.8.57    | 112.4 MB  | #7         |  18% [A[A[A[A[A[A[A
2025-05-07T20:26:20.5007565Z 
2025-05-07T20:26:20.5007569Z 
2025-05-07T20:26:20.5007573Z 
2025-05-07T20:26:20.5007576Z 
2025-05-07T20:26:20.5007580Z 
2025-05-07T20:26:20.5007583Z 
2025-05-07T20:26:20.5068734Z cuda-nsight-12.8.55  | 113.2 MB  | #7         |  17% [A[A[A[A[A[A
2025-05-07T20:26:20.5073628Z 
2025-05-07T20:26:20.5126730Z nsight-compute-2025. | 320.6 MB  | #########9 | 100% [A
2025-05-07T20:26:20.5127002Z 
2025-05-07T20:26:20.5127005Z 
2025-05-07T20:26:20.5127009Z 
2025-05-07T20:26:20.5127013Z 
2025-05-07T20:26:20.5129353Z 
2025-05-07T20:26:20.5875541Z libnpp-12.3.3.65     | 130.6 MB  | ###3       |  34% [A[A[A[A[A
2025-05-07T20:26:20.5959040Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:26:20.5959531Z 
2025-05-07T20:26:20.5959535Z 
2025-05-07T20:26:20.5959546Z 
2025-05-07T20:26:20.5959549Z 
2025-05-07T20:26:20.5959553Z 
2025-05-07T20:26:20.5959557Z 
2025-05-07T20:26:20.5970720Z 
2025-05-07T20:26:20.6010206Z cuda-nvvp-12.8.57    | 112.4 MB  | #9         |  20% [A[A[A[A[A[A[A
2025-05-07T20:26:20.6010522Z 
2025-05-07T20:26:20.6010526Z 
2025-05-07T20:26:20.6010530Z 
2025-05-07T20:26:20.6010533Z 
2025-05-07T20:26:20.6010537Z 
2025-05-07T20:26:20.6010543Z 
2025-05-07T20:26:20.6223661Z cuda-nsight-12.8.55  | 113.2 MB  | #9         |  19% [A[A[A[A[A[A
2025-05-07T20:26:20.6224071Z 
2025-05-07T20:26:20.6224077Z 
2025-05-07T20:26:20.6224082Z 
2025-05-07T20:26:20.6224087Z 
2025-05-07T20:26:20.6230084Z 
2025-05-07T20:26:20.6966821Z libnpp-12.3.3.65     | 130.6 MB  | ###5       |  36% [A[A[A[A[A
2025-05-07T20:26:20.6967126Z 
2025-05-07T20:26:20.6967159Z 
2025-05-07T20:26:20.6967164Z 
2025-05-07T20:26:20.6967167Z 
2025-05-07T20:26:20.6967171Z 
2025-05-07T20:26:20.6967176Z 
2025-05-07T20:26:20.6971680Z 
2025-05-07T20:26:20.6996542Z cuda-nvvp-12.8.57    | 112.4 MB  | ##2        |  22% [A[A[A[A[A[A[A
2025-05-07T20:26:20.7012970Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:26:20.7013227Z 
2025-05-07T20:26:20.7013233Z 
2025-05-07T20:26:20.7013249Z 
2025-05-07T20:26:20.7013255Z 
2025-05-07T20:26:20.7013260Z 
2025-05-07T20:26:20.7013401Z 
2025-05-07T20:26:20.7231363Z cuda-nsight-12.8.55  | 113.2 MB  | ##1        |  22% [A[A[A[A[A[A
2025-05-07T20:26:20.7231777Z 
2025-05-07T20:26:20.7231783Z 
2025-05-07T20:26:20.7231788Z 
2025-05-07T20:26:20.7231793Z 
2025-05-07T20:26:20.7231799Z 
2025-05-07T20:26:20.7968174Z libnpp-12.3.3.65     | 130.6 MB  | ###7       |  38% [A[A[A[A[A
2025-05-07T20:26:20.7968599Z 
2025-05-07T20:26:20.7968603Z 
2025-05-07T20:26:20.7968607Z 
2025-05-07T20:26:20.7968610Z 
2025-05-07T20:26:20.7968615Z 
2025-05-07T20:26:20.7968644Z 
2025-05-07T20:26:20.7970142Z 
2025-05-07T20:26:20.7999883Z cuda-nvvp-12.8.57    | 112.4 MB  | ##4        |  25% [A[A[A[A[A[A[A
2025-05-07T20:26:20.8019479Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:26:20.8019850Z 
2025-05-07T20:26:20.8019856Z 
2025-05-07T20:26:20.8019861Z 
2025-05-07T20:26:20.8019866Z 
2025-05-07T20:26:20.8019871Z 
2025-05-07T20:26:20.8022008Z 
2025-05-07T20:26:20.8233019Z cuda-nsight-12.8.55  | 113.2 MB  | ##3        |  24% [A[A[A[A[A[A
2025-05-07T20:26:20.8233318Z 
2025-05-07T20:26:20.8233322Z 
2025-05-07T20:26:20.8233326Z 
2025-05-07T20:26:20.8233337Z 
2025-05-07T20:26:20.8235829Z 
2025-05-07T20:26:20.8979482Z libnpp-12.3.3.65     | 130.6 MB  | ###9       |  40% [A[A[A[A[A
2025-05-07T20:26:20.8979888Z 
2025-05-07T20:26:20.8979894Z 
2025-05-07T20:26:20.8979911Z 
2025-05-07T20:26:20.8979917Z 
2025-05-07T20:26:20.8979922Z 
2025-05-07T20:26:20.8979927Z 
2025-05-07T20:26:20.8979934Z 
2025-05-07T20:26:20.9004048Z cuda-nvvp-12.8.57    | 112.4 MB  | ##7        |  27% [A[A[A[A[A[A[A
2025-05-07T20:26:20.9020395Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:26:20.9020792Z 
2025-05-07T20:26:20.9020798Z 
2025-05-07T20:26:20.9020803Z 
2025-05-07T20:26:20.9020808Z 
2025-05-07T20:26:20.9020814Z 
2025-05-07T20:26:20.9024847Z 
2025-05-07T20:26:20.9239823Z cuda-nsight-12.8.55  | 113.2 MB  | ##5        |  26% [A[A[A[A[A[A
2025-05-07T20:26:20.9240180Z 
2025-05-07T20:26:20.9240187Z 
2025-05-07T20:26:20.9240192Z 
2025-05-07T20:26:20.9240197Z 
2025-05-07T20:26:20.9242231Z 
2025-05-07T20:26:20.9988989Z libnpp-12.3.3.65     | 130.6 MB  | ####1      |  42% [A[A[A[A[A
2025-05-07T20:26:20.9989295Z 
2025-05-07T20:26:20.9989299Z 
2025-05-07T20:26:20.9989303Z 
2025-05-07T20:26:20.9989314Z 
2025-05-07T20:26:20.9989318Z 
2025-05-07T20:26:20.9989322Z 
2025-05-07T20:26:20.9992758Z 
2025-05-07T20:26:21.0005514Z cuda-nvvp-12.8.57    | 112.4 MB  | ##9        |  30% [A[A[A[A[A[A[A
2025-05-07T20:26:21.0029164Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:26:21.0029469Z 
2025-05-07T20:26:21.0029475Z 
2025-05-07T20:26:21.0029480Z 
2025-05-07T20:26:21.0029781Z 
2025-05-07T20:26:21.0029786Z 
2025-05-07T20:26:21.0032610Z 
2025-05-07T20:26:21.0270724Z cuda-nsight-12.8.55  | 113.2 MB  | ##8        |  28% [A[A[A[A[A[A
2025-05-07T20:26:21.0271104Z 
2025-05-07T20:26:21.0271109Z 
2025-05-07T20:26:21.0271112Z 
2025-05-07T20:26:21.0271124Z 
2025-05-07T20:26:21.0271127Z 
2025-05-07T20:26:21.0991125Z libnpp-12.3.3.65     | 130.6 MB  | ####3      |  44% [A[A[A[A[A
2025-05-07T20:26:21.0991577Z 
2025-05-07T20:26:21.0991584Z 
2025-05-07T20:26:21.0991589Z 
2025-05-07T20:26:21.0991594Z 
2025-05-07T20:26:21.0991599Z 
2025-05-07T20:26:21.0991604Z 
2025-05-07T20:26:21.0991610Z 
2025-05-07T20:26:21.1007607Z cuda-nvvp-12.8.57    | 112.4 MB  | ###2       |  32% [A[A[A[A[A[A[A
2025-05-07T20:26:21.1030831Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:26:21.1031087Z 
2025-05-07T20:26:21.1031121Z 
2025-05-07T20:26:21.1031127Z 
2025-05-07T20:26:21.1031132Z 
2025-05-07T20:26:21.1031146Z 
2025-05-07T20:26:21.1038299Z 
2025-05-07T20:26:21.1271885Z cuda-nsight-12.8.55  | 113.2 MB  | ###        |  30% [A[A[A[A[A[A
2025-05-07T20:26:21.1272220Z 
2025-05-07T20:26:21.1272226Z 
2025-05-07T20:26:21.1272240Z 
2025-05-07T20:26:21.1272246Z 
2025-05-07T20:26:21.1272251Z 
2025-05-07T20:26:21.2001470Z libnpp-12.3.3.65     | 130.6 MB  | ####5      |  46% [A[A[A[A[A
2025-05-07T20:26:21.2001868Z 
2025-05-07T20:26:21.2001881Z 
2025-05-07T20:26:21.2001885Z 
2025-05-07T20:26:21.2001889Z 
2025-05-07T20:26:21.2001892Z 
2025-05-07T20:26:21.2001896Z 
2025-05-07T20:26:21.2004269Z 
2025-05-07T20:26:21.2036220Z cuda-nvvp-12.8.57    | 112.4 MB  | ###4       |  35% [A[A[A[A[A[A[A
2025-05-07T20:26:21.2036620Z 
2025-05-07T20:26:21.2036624Z 
2025-05-07T20:26:21.2036628Z 
2025-05-07T20:26:21.2036631Z 
2025-05-07T20:26:21.2036635Z 
2025-05-07T20:26:21.2040721Z 
2025-05-07T20:26:21.2104252Z cuda-nsight-12.8.55  | 113.2 MB  | ###2       |  33% [A[A[A[A[A[A
2025-05-07T20:26:21.2339424Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:26:21.2339764Z 
2025-05-07T20:26:21.2339768Z 
2025-05-07T20:26:21.2339778Z 
2025-05-07T20:26:21.2339782Z 
2025-05-07T20:26:21.2339786Z 
2025-05-07T20:26:21.3041696Z libnpp-12.3.3.65     | 130.6 MB  | ####7      |  48% [A[A[A[A[A
2025-05-07T20:26:21.3042049Z 
2025-05-07T20:26:21.3042053Z 
2025-05-07T20:26:21.3042066Z 
2025-05-07T20:26:21.3042069Z 
2025-05-07T20:26:21.3042073Z 
2025-05-07T20:26:21.3043406Z 
2025-05-07T20:26:21.3109739Z cuda-nsight-12.8.55  | 113.2 MB  | ###4       |  35% [A[A[A[A[A[A
2025-05-07T20:26:21.3195463Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  74% 
2025-05-07T20:26:21.3195837Z 
2025-05-07T20:26:21.3195843Z 
2025-05-07T20:26:21.3195849Z 
2025-05-07T20:26:21.3195854Z 
2025-05-07T20:26:21.3195860Z 
2025-05-07T20:26:21.3195865Z 
2025-05-07T20:26:21.3195880Z 
2025-05-07T20:26:21.3344233Z cuda-nvvp-12.8.57    | 112.4 MB  | ###7       |  37% [A[A[A[A[A[A[A
2025-05-07T20:26:21.3344650Z 
2025-05-07T20:26:21.3344656Z 
2025-05-07T20:26:21.3344661Z 
2025-05-07T20:26:21.3344682Z 
2025-05-07T20:26:21.3346039Z 
2025-05-07T20:26:21.4084428Z libnpp-12.3.3.65     | 130.6 MB  | ####9      |  50% [A[A[A[A[A
2025-05-07T20:26:21.4084779Z 
2025-05-07T20:26:21.4084785Z 
2025-05-07T20:26:21.4084790Z 
2025-05-07T20:26:21.4084795Z 
2025-05-07T20:26:21.4084800Z 
2025-05-07T20:26:21.4084806Z 
2025-05-07T20:26:21.4116066Z cuda-nsight-12.8.55  | 113.2 MB  | ###6       |  37% [A[A[A[A[A[A
2025-05-07T20:26:21.4197483Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:26:21.4197805Z 
2025-05-07T20:26:21.4197810Z 
2025-05-07T20:26:21.4197813Z 
2025-05-07T20:26:21.4197817Z 
2025-05-07T20:26:21.4197821Z 
2025-05-07T20:26:21.4197825Z 
2025-05-07T20:26:21.4199914Z 
2025-05-07T20:26:21.4409662Z cuda-nvvp-12.8.57    | 112.4 MB  | ###9       |  39% [A[A[A[A[A[A[A
2025-05-07T20:26:21.4410065Z 
2025-05-07T20:26:21.4410070Z 
2025-05-07T20:26:21.4410094Z 
2025-05-07T20:26:21.4410098Z 
2025-05-07T20:26:21.4414493Z 
2025-05-07T20:26:21.5094291Z libnpp-12.3.3.65     | 130.6 MB  | #####1     |  52% [A[A[A[A[A
2025-05-07T20:26:21.5094868Z 
2025-05-07T20:26:21.5094872Z 
2025-05-07T20:26:21.5094876Z 
2025-05-07T20:26:21.5094880Z 
2025-05-07T20:26:21.5094884Z 
2025-05-07T20:26:21.5095544Z 
2025-05-07T20:26:21.5170955Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  39% [A[A[A[A[A[A
2025-05-07T20:26:21.5248183Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  75% 
2025-05-07T20:26:21.5248540Z 
2025-05-07T20:26:21.5248547Z 
2025-05-07T20:26:21.5248552Z 
2025-05-07T20:26:21.5248559Z 
2025-05-07T20:26:21.5248565Z 
2025-05-07T20:26:21.5248570Z 
2025-05-07T20:26:21.5251834Z 
2025-05-07T20:26:21.5412937Z cuda-nvvp-12.8.57    | 112.4 MB  | ####1      |  42% [A[A[A[A[A[A[A
2025-05-07T20:26:21.5413229Z 
2025-05-07T20:26:21.5413233Z 
2025-05-07T20:26:21.5413237Z 
2025-05-07T20:26:21.5413240Z 
2025-05-07T20:26:21.5415439Z 
2025-05-07T20:26:21.6095248Z libnpp-12.3.3.65     | 130.6 MB  | #####3     |  54% [A[A[A[A[A
2025-05-07T20:26:21.6095549Z 
2025-05-07T20:26:21.6095567Z 
2025-05-07T20:26:21.6095571Z 
2025-05-07T20:26:21.6095574Z 
2025-05-07T20:26:21.6095578Z 
2025-05-07T20:26:21.6095581Z 
2025-05-07T20:26:21.6214748Z cuda-nsight-12.8.55  | 113.2 MB  | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:26:21.6383712Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:26:21.6384067Z 
2025-05-07T20:26:21.6384074Z 
2025-05-07T20:26:21.6384079Z 
2025-05-07T20:26:21.6384084Z 
2025-05-07T20:26:21.6384089Z 
2025-05-07T20:26:21.6384094Z 
2025-05-07T20:26:21.6384099Z 
2025-05-07T20:26:21.6434190Z cuda-nvvp-12.8.57    | 112.4 MB  | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:21.6434540Z 
2025-05-07T20:26:21.6434545Z 
2025-05-07T20:26:21.6434549Z 
2025-05-07T20:26:21.6434552Z 
2025-05-07T20:26:21.6434556Z 
2025-05-07T20:26:21.7099632Z libnpp-12.3.3.65     | 130.6 MB  | #####5     |  56% [A[A[A[A[A
2025-05-07T20:26:21.7099959Z 
2025-05-07T20:26:21.7099964Z 
2025-05-07T20:26:21.7099967Z 
2025-05-07T20:26:21.7099971Z 
2025-05-07T20:26:21.7099975Z 
2025-05-07T20:26:21.7101602Z 
2025-05-07T20:26:21.7252226Z cuda-nsight-12.8.55  | 113.2 MB  | ####3      |  44% [A[A[A[A[A[A
2025-05-07T20:26:21.7411271Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:26:21.7411526Z 
2025-05-07T20:26:21.7411530Z 
2025-05-07T20:26:21.7411534Z 
2025-05-07T20:26:21.7411538Z 
2025-05-07T20:26:21.7411542Z 
2025-05-07T20:26:21.7411545Z 
2025-05-07T20:26:21.7411549Z 
2025-05-07T20:26:21.7437058Z cuda-nvvp-12.8.57    | 112.4 MB  | ####6      |  46% [A[A[A[A[A[A[A
2025-05-07T20:26:21.7437346Z 
2025-05-07T20:26:21.7437351Z 
2025-05-07T20:26:21.7437354Z 
2025-05-07T20:26:21.7437358Z 
2025-05-07T20:26:21.7437362Z 
2025-05-07T20:26:21.8122306Z libnpp-12.3.3.65     | 130.6 MB  | #####7     |  58% [A[A[A[A[A
2025-05-07T20:26:21.8122610Z 
2025-05-07T20:26:21.8122615Z 
2025-05-07T20:26:21.8122619Z 
2025-05-07T20:26:21.8122862Z 
2025-05-07T20:26:21.8122868Z 
2025-05-07T20:26:21.8123644Z 
2025-05-07T20:26:21.8350949Z cuda-nsight-12.8.55  | 113.2 MB  | ####6      |  46% [A[A[A[A[A[A
2025-05-07T20:26:21.8422633Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:26:21.8422890Z 
2025-05-07T20:26:21.8422894Z 
2025-05-07T20:26:21.8422898Z 
2025-05-07T20:26:21.8422902Z 
2025-05-07T20:26:21.8422905Z 
2025-05-07T20:26:21.8422909Z 
2025-05-07T20:26:21.8430005Z 
2025-05-07T20:26:21.8538162Z cuda-nvvp-12.8.57    | 112.4 MB  | ####8      |  48% [A[A[A[A[A[A[A
2025-05-07T20:26:21.8538462Z 
2025-05-07T20:26:21.8538472Z 
2025-05-07T20:26:21.8538478Z 
2025-05-07T20:26:21.8538483Z 
2025-05-07T20:26:21.8540106Z 
2025-05-07T20:26:21.9123780Z libnpp-12.3.3.65     | 130.6 MB  | #####9     |  60% [A[A[A[A[A
2025-05-07T20:26:21.9124190Z 
2025-05-07T20:26:21.9124196Z 
2025-05-07T20:26:21.9124201Z 
2025-05-07T20:26:21.9124206Z 
2025-05-07T20:26:21.9124212Z 
2025-05-07T20:26:21.9127186Z 
2025-05-07T20:26:21.9436348Z cuda-nsight-12.8.55  | 113.2 MB  | ####8      |  49% [A[A[A[A[A[A
2025-05-07T20:26:21.9543275Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:26:21.9543830Z 
2025-05-07T20:26:21.9543834Z 
2025-05-07T20:26:21.9543838Z 
2025-05-07T20:26:21.9543842Z 
2025-05-07T20:26:21.9543845Z 
2025-05-07T20:26:21.9546139Z libnpp-12.3.3.65     | 130.6 MB  | ######1    |  62% [A[A[A[A[A
2025-05-07T20:26:21.9546439Z 
2025-05-07T20:26:21.9546443Z 
2025-05-07T20:26:21.9546446Z 
2025-05-07T20:26:21.9546450Z 
2025-05-07T20:26:21.9546454Z 
2025-05-07T20:26:21.9546457Z 
2025-05-07T20:26:21.9553725Z 
2025-05-07T20:26:22.0171174Z cuda-nvvp-12.8.57    | 112.4 MB  | #####      |  51% [A[A[A[A[A[A[A
2025-05-07T20:26:22.0171477Z 
2025-05-07T20:26:22.0171482Z 
2025-05-07T20:26:22.0171485Z 
2025-05-07T20:26:22.0171489Z 
2025-05-07T20:26:22.0171501Z 
2025-05-07T20:26:22.0171505Z 
2025-05-07T20:26:22.0438019Z cuda-nsight-12.8.55  | 113.2 MB  | #####      |  51% [A[A[A[A[A[A
2025-05-07T20:26:22.0544611Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:26:22.0544996Z 
2025-05-07T20:26:22.0545002Z 
2025-05-07T20:26:22.0545024Z 
2025-05-07T20:26:22.0545029Z 
2025-05-07T20:26:22.0547073Z 
2025-05-07T20:26:22.0555561Z libnpp-12.3.3.65     | 130.6 MB  | ######3    |  64% [A[A[A[A[A
2025-05-07T20:26:22.0555835Z 
2025-05-07T20:26:22.0555839Z 
2025-05-07T20:26:22.0555843Z 
2025-05-07T20:26:22.0555846Z 
2025-05-07T20:26:22.0555850Z 
2025-05-07T20:26:22.0555854Z 
2025-05-07T20:26:22.0560823Z 
2025-05-07T20:26:22.1171965Z cuda-nvvp-12.8.57    | 112.4 MB  | #####2     |  53% [A[A[A[A[A[A[A
2025-05-07T20:26:22.1172294Z 
2025-05-07T20:26:22.1172299Z 
2025-05-07T20:26:22.1172305Z 
2025-05-07T20:26:22.1172309Z 
2025-05-07T20:26:22.1172314Z 
2025-05-07T20:26:22.1172595Z 
2025-05-07T20:26:22.1443274Z cuda-nsight-12.8.55  | 113.2 MB  | #####3     |  53% [A[A[A[A[A[A
2025-05-07T20:26:22.1556591Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:26:22.1556949Z 
2025-05-07T20:26:22.1556974Z 
2025-05-07T20:26:22.1556978Z 
2025-05-07T20:26:22.1556982Z 
2025-05-07T20:26:22.1556986Z 
2025-05-07T20:26:22.1557002Z 
2025-05-07T20:26:22.1557005Z 
2025-05-07T20:26:22.1569013Z cuda-nvvp-12.8.57    | 112.4 MB  | #####5     |  55% [A[A[A[A[A[A[A
2025-05-07T20:26:22.1569312Z 
2025-05-07T20:26:22.1569316Z 
2025-05-07T20:26:22.1569320Z 
2025-05-07T20:26:22.1569323Z 
2025-05-07T20:26:22.1571540Z 
2025-05-07T20:26:22.2304605Z libnpp-12.3.3.65     | 130.6 MB  | ######5    |  66% [A[A[A[A[A
2025-05-07T20:26:22.2304903Z 
2025-05-07T20:26:22.2304907Z 
2025-05-07T20:26:22.2304911Z 
2025-05-07T20:26:22.2304914Z 
2025-05-07T20:26:22.2304925Z 
2025-05-07T20:26:22.2306499Z 
2025-05-07T20:26:22.2450765Z cuda-nsight-12.8.55  | 113.2 MB  | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:26:22.2575372Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:26:22.2575731Z 
2025-05-07T20:26:22.2575736Z 
2025-05-07T20:26:22.2575741Z 
2025-05-07T20:26:22.2575746Z 
2025-05-07T20:26:22.2576047Z 
2025-05-07T20:26:22.2576061Z 
2025-05-07T20:26:22.2576066Z 
2025-05-07T20:26:22.2576872Z cuda-nvvp-12.8.57    | 112.4 MB  | #####7     |  58% [A[A[A[A[A[A[A
2025-05-07T20:26:22.2577169Z 
2025-05-07T20:26:22.2577173Z 
2025-05-07T20:26:22.2577177Z 
2025-05-07T20:26:22.2577188Z 
2025-05-07T20:26:22.2577438Z 
2025-05-07T20:26:22.3307193Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:26:22.3307487Z 
2025-05-07T20:26:22.3307491Z 
2025-05-07T20:26:22.3307507Z 
2025-05-07T20:26:22.3307511Z 
2025-05-07T20:26:22.3307515Z 
2025-05-07T20:26:22.3309520Z 
2025-05-07T20:26:22.3452999Z cuda-nsight-12.8.55  | 113.2 MB  | #####7     |  58% [A[A[A[A[A[A
2025-05-07T20:26:22.3581108Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:26:22.3581453Z 
2025-05-07T20:26:22.3581458Z 
2025-05-07T20:26:22.3581464Z 
2025-05-07T20:26:22.3581470Z 
2025-05-07T20:26:22.3581474Z 
2025-05-07T20:26:22.3593855Z libnpp-12.3.3.65     | 130.6 MB  | #######    |  70% [A[A[A[A[A
2025-05-07T20:26:22.3594127Z 
2025-05-07T20:26:22.3594131Z 
2025-05-07T20:26:22.3594134Z 
2025-05-07T20:26:22.3594138Z 
2025-05-07T20:26:22.3594458Z 
2025-05-07T20:26:22.3594463Z 
2025-05-07T20:26:22.3599498Z 
2025-05-07T20:26:22.4416445Z cuda-nvvp-12.8.57    | 112.4 MB  | ######     |  60% [A[A[A[A[A[A[A
2025-05-07T20:26:22.4416746Z 
2025-05-07T20:26:22.4416750Z 
2025-05-07T20:26:22.4416765Z 
2025-05-07T20:26:22.4416769Z 
2025-05-07T20:26:22.4416772Z 
2025-05-07T20:26:22.4416776Z 
2025-05-07T20:26:22.4456165Z cuda-nsight-12.8.55  | 113.2 MB  | ######     |  60% [A[A[A[A[A[A
2025-05-07T20:26:22.4581651Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:26:22.4581993Z 
2025-05-07T20:26:22.4582061Z 
2025-05-07T20:26:22.4582066Z 
2025-05-07T20:26:22.4582073Z 
2025-05-07T20:26:22.4582094Z 
2025-05-07T20:26:22.4596731Z libnpp-12.3.3.65     | 130.6 MB  | #######2   |  72% [A[A[A[A[A
2025-05-07T20:26:22.4597168Z 
2025-05-07T20:26:22.4597176Z 
2025-05-07T20:26:22.4597208Z 
2025-05-07T20:26:22.4597214Z 
2025-05-07T20:26:22.4597220Z 
2025-05-07T20:26:22.4597224Z 
2025-05-07T20:26:22.4597229Z 
2025-05-07T20:26:22.5426818Z cuda-nvvp-12.8.57    | 112.4 MB  | ######2    |  63% [A[A[A[A[A[A[A
2025-05-07T20:26:22.5427117Z 
2025-05-07T20:26:22.5427121Z 
2025-05-07T20:26:22.5427124Z 
2025-05-07T20:26:22.5427128Z 
2025-05-07T20:26:22.5427131Z 
2025-05-07T20:26:22.5427135Z 
2025-05-07T20:26:22.5456556Z cuda-nsight-12.8.55  | 113.2 MB  | ######2    |  62% [A[A[A[A[A[A
2025-05-07T20:26:22.5583769Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  81% 
2025-05-07T20:26:22.5584026Z 
2025-05-07T20:26:22.5584030Z 
2025-05-07T20:26:22.5584034Z 
2025-05-07T20:26:22.5584038Z 
2025-05-07T20:26:22.5584041Z 
2025-05-07T20:26:22.5598453Z libnpp-12.3.3.65     | 130.6 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:26:22.5598730Z 
2025-05-07T20:26:22.5598734Z 
2025-05-07T20:26:22.5598737Z 
2025-05-07T20:26:22.5598741Z 
2025-05-07T20:26:22.5598744Z 
2025-05-07T20:26:22.5598749Z 
2025-05-07T20:26:22.5602139Z 
2025-05-07T20:26:22.6430659Z cuda-nvvp-12.8.57    | 112.4 MB  | ######5    |  65% [A[A[A[A[A[A[A
2025-05-07T20:26:22.6430986Z 
2025-05-07T20:26:22.6430990Z 
2025-05-07T20:26:22.6430994Z 
2025-05-07T20:26:22.6430998Z 
2025-05-07T20:26:22.6431001Z 
2025-05-07T20:26:22.6433412Z 
2025-05-07T20:26:22.6599359Z cuda-nsight-12.8.55  | 113.2 MB  | ######4    |  64% [A[A[A[A[A[A
2025-05-07T20:26:22.6657803Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:26:22.6658064Z 
2025-05-07T20:26:22.6658068Z 
2025-05-07T20:26:22.6658071Z 
2025-05-07T20:26:22.6658075Z 
2025-05-07T20:26:22.6658078Z 
2025-05-07T20:26:22.6658082Z 
2025-05-07T20:26:22.6663472Z 
2025-05-07T20:26:22.6669060Z cuda-nvvp-12.8.57    | 112.4 MB  | ######7    |  68% [A[A[A[A[A[A[A
2025-05-07T20:26:22.6669351Z 
2025-05-07T20:26:22.6669355Z 
2025-05-07T20:26:22.6669359Z 
2025-05-07T20:26:22.6669362Z 
2025-05-07T20:26:22.6672948Z 
2025-05-07T20:26:22.7451981Z libnpp-12.3.3.65     | 130.6 MB  | #######6   |  77% [A[A[A[A[A
2025-05-07T20:26:22.7452279Z 
2025-05-07T20:26:22.7452282Z 
2025-05-07T20:26:22.7452286Z 
2025-05-07T20:26:22.7452300Z 
2025-05-07T20:26:22.7452304Z 
2025-05-07T20:26:22.7456687Z 
2025-05-07T20:26:22.7687428Z cuda-nsight-12.8.55  | 113.2 MB  | ######6    |  67% [A[A[A[A[A[A
2025-05-07T20:26:22.7687844Z 
2025-05-07T20:26:22.7687850Z 
2025-05-07T20:26:22.7687856Z 
2025-05-07T20:26:22.7687861Z 
2025-05-07T20:26:22.7689816Z 
2025-05-07T20:26:22.7791192Z libnpp-12.3.3.65     | 130.6 MB  | #######8   |  79% [A[A[A[A[A
2025-05-07T20:26:22.7807027Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:26:22.7807394Z 
2025-05-07T20:26:22.7807400Z 
2025-05-07T20:26:22.7807406Z 
2025-05-07T20:26:22.7807410Z 
2025-05-07T20:26:22.7807415Z 
2025-05-07T20:26:22.7807420Z 
2025-05-07T20:26:22.7807426Z 
2025-05-07T20:26:22.8528097Z cuda-nvvp-12.8.57    | 112.4 MB  | #######    |  70% [A[A[A[A[A[A[A
2025-05-07T20:26:22.8528419Z 
2025-05-07T20:26:22.8528423Z 
2025-05-07T20:26:22.8528449Z 
2025-05-07T20:26:22.8528453Z 
2025-05-07T20:26:22.8528456Z 
2025-05-07T20:26:22.8530378Z 
2025-05-07T20:26:22.8692358Z cuda-nsight-12.8.55  | 113.2 MB  | ######8    |  69% [A[A[A[A[A[A
2025-05-07T20:26:22.8692654Z 
2025-05-07T20:26:22.8692658Z 
2025-05-07T20:26:22.8692662Z 
2025-05-07T20:26:22.8692665Z 
2025-05-07T20:26:22.8692669Z 
2025-05-07T20:26:22.8794997Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  81% [A[A[A[A[A
2025-05-07T20:26:22.8902097Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:26:22.8902390Z 
2025-05-07T20:26:22.8902394Z 
2025-05-07T20:26:22.8902397Z 
2025-05-07T20:26:22.8902401Z 
2025-05-07T20:26:22.8902404Z 
2025-05-07T20:26:22.8902408Z 
2025-05-07T20:26:22.8902412Z 
2025-05-07T20:26:22.9528912Z cuda-nvvp-12.8.57    | 112.4 MB  | #######2   |  72% [A[A[A[A[A[A[A
2025-05-07T20:26:22.9529230Z 
2025-05-07T20:26:22.9529234Z 
2025-05-07T20:26:22.9529237Z 
2025-05-07T20:26:22.9529241Z 
2025-05-07T20:26:22.9529245Z 
2025-05-07T20:26:22.9530625Z 
2025-05-07T20:26:22.9736194Z cuda-nsight-12.8.55  | 113.2 MB  | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:26:22.9736506Z 
2025-05-07T20:26:22.9736510Z 
2025-05-07T20:26:22.9736513Z 
2025-05-07T20:26:22.9736517Z 
2025-05-07T20:26:22.9742101Z 
2025-05-07T20:26:22.9886092Z libnpp-12.3.3.65     | 130.6 MB  | ########2  |  83% [A[A[A[A[A
2025-05-07T20:26:22.9978843Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  83% 
2025-05-07T20:26:22.9979234Z 
2025-05-07T20:26:22.9979241Z 
2025-05-07T20:26:22.9979246Z 
2025-05-07T20:26:22.9979251Z 
2025-05-07T20:26:22.9979257Z 
2025-05-07T20:26:22.9979262Z 
2025-05-07T20:26:22.9979436Z 
2025-05-07T20:26:23.0534005Z cuda-nvvp-12.8.57    | 112.4 MB  | #######4   |  75% [A[A[A[A[A[A[A
2025-05-07T20:26:23.0534297Z 
2025-05-07T20:26:23.0534301Z 
2025-05-07T20:26:23.0534305Z 
2025-05-07T20:26:23.0534308Z 
2025-05-07T20:26:23.0534313Z 
2025-05-07T20:26:23.0534521Z 
2025-05-07T20:26:23.0757453Z cuda-nsight-12.8.55  | 113.2 MB  | #######3   |  73% [A[A[A[A[A[A
2025-05-07T20:26:23.0757752Z 
2025-05-07T20:26:23.0757756Z 
2025-05-07T20:26:23.0757759Z 
2025-05-07T20:26:23.0757763Z 
2025-05-07T20:26:23.0767314Z 
2025-05-07T20:26:23.0920169Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:26:23.0988020Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:26:23.0988297Z 
2025-05-07T20:26:23.0988302Z 
2025-05-07T20:26:23.0988305Z 
2025-05-07T20:26:23.0988309Z 
2025-05-07T20:26:23.0988313Z 
2025-05-07T20:26:23.0988316Z 
2025-05-07T20:26:23.0988320Z 
2025-05-07T20:26:23.1542352Z cuda-nvvp-12.8.57    | 112.4 MB  | #######7   |  77% [A[A[A[A[A[A[A
2025-05-07T20:26:23.1542658Z 
2025-05-07T20:26:23.1542662Z 
2025-05-07T20:26:23.1542666Z 
2025-05-07T20:26:23.1542670Z 
2025-05-07T20:26:23.1542674Z 
2025-05-07T20:26:23.1546021Z 
2025-05-07T20:26:23.1961103Z cuda-nsight-12.8.55  | 113.2 MB  | #######5   |  76% [A[A[A[A[A[A
2025-05-07T20:26:23.1961412Z 
2025-05-07T20:26:23.1961416Z 
2025-05-07T20:26:23.1961655Z 
2025-05-07T20:26:23.1961660Z 
2025-05-07T20:26:23.1963535Z 
2025-05-07T20:26:23.1989700Z libnpp-12.3.3.65     | 130.6 MB  | ########6  |  87% [A[A[A[A[A
2025-05-07T20:26:23.1990000Z 
2025-05-07T20:26:23.1990004Z 
2025-05-07T20:26:23.1990007Z 
2025-05-07T20:26:23.1990011Z 
2025-05-07T20:26:23.1990014Z 
2025-05-07T20:26:23.1990018Z 
2025-05-07T20:26:23.1998764Z 
2025-05-07T20:26:23.2001693Z cuda-nvvp-12.8.57    | 112.4 MB  | #######9   |  80% [A[A[A[A[A[A[A
2025-05-07T20:26:23.2585281Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:26:23.2585610Z 
2025-05-07T20:26:23.2585642Z 
2025-05-07T20:26:23.2585648Z 
2025-05-07T20:26:23.2585653Z 
2025-05-07T20:26:23.2585658Z 
2025-05-07T20:26:23.2585882Z 
2025-05-07T20:26:23.3007278Z cuda-nsight-12.8.55  | 113.2 MB  | #######7   |  78% [A[A[A[A[A[A
2025-05-07T20:26:23.3036755Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  84% 
2025-05-07T20:26:23.3037137Z 
2025-05-07T20:26:23.3037145Z 
2025-05-07T20:26:23.3037151Z 
2025-05-07T20:26:23.3037184Z 
2025-05-07T20:26:23.3037190Z 
2025-05-07T20:26:23.3037196Z 
2025-05-07T20:26:23.3037231Z 
2025-05-07T20:26:23.3068743Z cuda-nvvp-12.8.57    | 112.4 MB  | ########1  |  82% [A[A[A[A[A[A[A
2025-05-07T20:26:23.3069135Z 
2025-05-07T20:26:23.3069140Z 
2025-05-07T20:26:23.3069143Z 
2025-05-07T20:26:23.3069147Z 
2025-05-07T20:26:23.3069367Z 
2025-05-07T20:26:23.3742571Z libnpp-12.3.3.65     | 130.6 MB  | ########8  |  89% [A[A[A[A[A
2025-05-07T20:26:23.3742858Z 
2025-05-07T20:26:23.3742862Z 
2025-05-07T20:26:23.3742866Z 
2025-05-07T20:26:23.3742869Z 
2025-05-07T20:26:23.3742873Z 
2025-05-07T20:26:23.3742880Z 
2025-05-07T20:26:23.4008778Z cuda-nsight-12.8.55  | 113.2 MB  | ########   |  80% [A[A[A[A[A[A
2025-05-07T20:26:23.4040329Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:26:23.4040604Z 
2025-05-07T20:26:23.4040608Z 
2025-05-07T20:26:23.4040612Z 
2025-05-07T20:26:23.4040616Z 
2025-05-07T20:26:23.4040620Z 
2025-05-07T20:26:23.4040624Z 
2025-05-07T20:26:23.4040652Z 
2025-05-07T20:26:23.4097320Z cuda-nvvp-12.8.57    | 112.4 MB  | ########4  |  84% [A[A[A[A[A[A[A
2025-05-07T20:26:23.4097778Z 
2025-05-07T20:26:23.4097784Z 
2025-05-07T20:26:23.4097789Z 
2025-05-07T20:26:23.4097794Z 
2025-05-07T20:26:23.4102087Z 
2025-05-07T20:26:23.4847649Z libnpp-12.3.3.65     | 130.6 MB  | #########  |  91% [A[A[A[A[A
2025-05-07T20:26:23.4847990Z 
2025-05-07T20:26:23.4847996Z 
2025-05-07T20:26:23.4848001Z 
2025-05-07T20:26:23.4848006Z 
2025-05-07T20:26:23.4848011Z 
2025-05-07T20:26:23.4848016Z 
2025-05-07T20:26:23.5018892Z cuda-nsight-12.8.55  | 113.2 MB  | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:26:23.5044920Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  86% 
2025-05-07T20:26:23.5045166Z 
2025-05-07T20:26:23.5045173Z 
2025-05-07T20:26:23.5045520Z 
2025-05-07T20:26:23.5045581Z 
2025-05-07T20:26:23.5045585Z 
2025-05-07T20:26:23.5045589Z 
2025-05-07T20:26:23.5045734Z 
2025-05-07T20:26:23.5101872Z cuda-nvvp-12.8.57    | 112.4 MB  | ########6  |  87% [A[A[A[A[A[A[A
2025-05-07T20:26:23.5102171Z 
2025-05-07T20:26:23.5102175Z 
2025-05-07T20:26:23.5102179Z 
2025-05-07T20:26:23.5102193Z 
2025-05-07T20:26:23.5107769Z 
2025-05-07T20:26:23.5914163Z libnpp-12.3.3.65     | 130.6 MB  | #########2 |  92% [A[A[A[A[A
2025-05-07T20:26:23.5914448Z 
2025-05-07T20:26:23.5914452Z 
2025-05-07T20:26:23.5914456Z 
2025-05-07T20:26:23.5914459Z 
2025-05-07T20:26:23.5914463Z 
2025-05-07T20:26:23.5917774Z 
2025-05-07T20:26:23.6020888Z cuda-nsight-12.8.55  | 113.2 MB  | ########4  |  84% [A[A[A[A[A[A
2025-05-07T20:26:23.6077655Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:26:23.6077897Z 
2025-05-07T20:26:23.6077900Z 
2025-05-07T20:26:23.6077904Z 
2025-05-07T20:26:23.6077908Z 
2025-05-07T20:26:23.6077919Z 
2025-05-07T20:26:23.6077923Z 
2025-05-07T20:26:23.6080367Z 
2025-05-07T20:26:23.6197699Z cuda-nvvp-12.8.57    | 112.4 MB  | ########9  |  89% [A[A[A[A[A[A[A
2025-05-07T20:26:23.6197973Z 
2025-05-07T20:26:23.6197983Z 
2025-05-07T20:26:23.6198214Z 
2025-05-07T20:26:23.6198219Z 
2025-05-07T20:26:23.6200061Z 
2025-05-07T20:26:23.6920084Z libnpp-12.3.3.65     | 130.6 MB  | #########4 |  94% [A[A[A[A[A
2025-05-07T20:26:23.6920372Z 
2025-05-07T20:26:23.6920376Z 
2025-05-07T20:26:23.6920380Z 
2025-05-07T20:26:23.6920383Z 
2025-05-07T20:26:23.6920387Z 
2025-05-07T20:26:23.6920390Z 
2025-05-07T20:26:23.7023199Z cuda-nsight-12.8.55  | 113.2 MB  | ########6  |  86% [A[A[A[A[A[A
2025-05-07T20:26:23.7099050Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:26:23.7099293Z 
2025-05-07T20:26:23.7099297Z 
2025-05-07T20:26:23.7099301Z 
2025-05-07T20:26:23.7099304Z 
2025-05-07T20:26:23.7099315Z 
2025-05-07T20:26:23.7099319Z 
2025-05-07T20:26:23.7099323Z 
2025-05-07T20:26:23.7282718Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:26:23.7282991Z 
2025-05-07T20:26:23.7282994Z 
2025-05-07T20:26:23.7283005Z 
2025-05-07T20:26:23.7283009Z 
2025-05-07T20:26:23.7285491Z 
2025-05-07T20:26:23.7922062Z libnpp-12.3.3.65     | 130.6 MB  | #########6 |  96% [A[A[A[A[A
2025-05-07T20:26:23.7922330Z 
2025-05-07T20:26:23.7922341Z 
2025-05-07T20:26:23.7922873Z 
2025-05-07T20:26:23.7922877Z 
2025-05-07T20:26:23.7922881Z 
2025-05-07T20:26:23.7924669Z 
2025-05-07T20:26:23.8037588Z cuda-nsight-12.8.55  | 113.2 MB  | ########8  |  89% [A[A[A[A[A[A
2025-05-07T20:26:23.8182824Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  87% 
2025-05-07T20:26:23.8183067Z 
2025-05-07T20:26:23.8183071Z 
2025-05-07T20:26:23.8183074Z 
2025-05-07T20:26:23.8183078Z 
2025-05-07T20:26:23.8183082Z 
2025-05-07T20:26:23.8183092Z 
2025-05-07T20:26:23.8183096Z 
2025-05-07T20:26:23.8287359Z cuda-nvvp-12.8.57    | 112.4 MB  | #########3 |  94% [A[A[A[A[A[A[A
2025-05-07T20:26:23.8287633Z 
2025-05-07T20:26:23.8287636Z 
2025-05-07T20:26:23.8287640Z 
2025-05-07T20:26:23.8287650Z 
2025-05-07T20:26:23.8290877Z 
2025-05-07T20:26:23.8926326Z libnpp-12.3.3.65     | 130.6 MB  | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:23.8926609Z 
2025-05-07T20:26:23.8926613Z 
2025-05-07T20:26:23.8926630Z 
2025-05-07T20:26:23.8926633Z 
2025-05-07T20:26:23.8926637Z 
2025-05-07T20:26:23.8926648Z 
2025-05-07T20:26:23.9109308Z cuda-nsight-12.8.55  | 113.2 MB  | #########  |  91% [A[A[A[A[A[A
2025-05-07T20:26:23.9193618Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  88% 
2025-05-07T20:26:23.9193897Z 
2025-05-07T20:26:23.9193912Z 
2025-05-07T20:26:23.9193917Z 
2025-05-07T20:26:23.9193922Z 
2025-05-07T20:26:23.9193927Z 
2025-05-07T20:26:23.9193932Z 
2025-05-07T20:26:23.9193937Z 
2025-05-07T20:26:23.9931608Z cuda-nvvp-12.8.57    | 112.4 MB  | #########6 |  96% [A[A[A[A[A[A[A
2025-05-07T20:26:23.9932007Z 
2025-05-07T20:26:23.9932013Z 
2025-05-07T20:26:23.9932019Z 
2025-05-07T20:26:23.9932024Z 
2025-05-07T20:26:23.9932029Z 
2025-05-07T20:26:23.9933583Z 
2025-05-07T20:26:24.0114398Z cuda-nsight-12.8.55  | 113.2 MB  | #########2 |  93% [A[A[A[A[A[A
2025-05-07T20:26:24.0193706Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  89% 
2025-05-07T20:26:24.0194080Z 
2025-05-07T20:26:24.0194087Z 
2025-05-07T20:26:24.0194092Z 
2025-05-07T20:26:24.0194097Z 
2025-05-07T20:26:24.0194114Z 
2025-05-07T20:26:24.0194119Z 
2025-05-07T20:26:24.0194124Z 
2025-05-07T20:26:24.0936272Z cuda-nvvp-12.8.57    | 112.4 MB  | #########8 |  99% [A[A[A[A[A[A[A
2025-05-07T20:26:24.0936656Z 
2025-05-07T20:26:24.0936662Z 
2025-05-07T20:26:24.0936667Z 
2025-05-07T20:26:24.0936672Z 
2025-05-07T20:26:24.0936678Z 
2025-05-07T20:26:24.0938084Z 
2025-05-07T20:26:24.1119793Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  95% [A[A[A[A[A[A
2025-05-07T20:26:24.1937174Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:26:24.1937536Z 
2025-05-07T20:26:24.1937542Z 
2025-05-07T20:26:24.1937547Z 
2025-05-07T20:26:24.1937552Z 
2025-05-07T20:26:24.1937557Z 
2025-05-07T20:26:24.1939432Z 
2025-05-07T20:26:24.2351838Z cuda-nsight-12.8.55  | 113.2 MB  | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:24.3710402Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  90% 
2025-05-07T20:26:24.4717210Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  90% 
2025-05-07T20:26:24.5718006Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  91% 
2025-05-07T20:26:24.6726127Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  92% 
2025-05-07T20:26:24.7737572Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  92% 
2025-05-07T20:26:24.8738874Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  93% 
2025-05-07T20:26:24.9742628Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  94% 
2025-05-07T20:26:25.0743635Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  95% 
2025-05-07T20:26:25.1744178Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  96% 
2025-05-07T20:26:25.2745405Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  97% 
2025-05-07T20:26:25.3823529Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  98% 
2025-05-07T20:26:25.4913189Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  98% 
2025-05-07T20:26:27.4230448Z libcublas-12.8.3.14  | 460.2 MB  | #########9 |  99% 
2025-05-07T20:26:27.4230754Z 
2025-05-07T20:26:27.4230761Z 
2025-05-07T20:26:27.4230766Z 
2025-05-07T20:26:27.4231920Z 
2025-05-07T20:26:27.7831667Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:27.7832232Z 
2025-05-07T20:26:27.7832236Z 
2025-05-07T20:26:27.7832239Z 
2025-05-07T20:26:27.7832243Z 
2025-05-07T20:26:27.7832247Z 
2025-05-07T20:26:27.7832251Z 
2025-05-07T20:26:27.7836213Z 
2025-05-07T20:26:27.8232809Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:27.8233208Z 
2025-05-07T20:26:27.8233214Z 
2025-05-07T20:26:27.8233219Z 
2025-05-07T20:26:27.8233224Z 
2025-05-07T20:26:27.8233229Z 
2025-05-07T20:26:27.8233234Z 
2025-05-07T20:26:27.8233239Z 
2025-05-07T20:26:27.8233245Z 
2025-05-07T20:26:27.9236210Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9236598Z 
2025-05-07T20:26:27.9236604Z 
2025-05-07T20:26:27.9236609Z 
2025-05-07T20:26:27.9236614Z 
2025-05-07T20:26:27.9236620Z 
2025-05-07T20:26:27.9236642Z 
2025-05-07T20:26:27.9236648Z 
2025-05-07T20:26:27.9236653Z 
2025-05-07T20:26:28.0227938Z cuda-nvrtc-12.8.61   | 63.1 MB   | 4          |   5% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0228354Z 
2025-05-07T20:26:28.0228360Z 
2025-05-07T20:26:28.0228365Z 
2025-05-07T20:26:28.0228371Z 
2025-05-07T20:26:28.0228376Z 
2025-05-07T20:26:28.0229734Z 
2025-05-07T20:26:28.0239868Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:28.0240276Z 
2025-05-07T20:26:28.0240282Z 
2025-05-07T20:26:28.0240288Z 
2025-05-07T20:26:28.0240300Z 
2025-05-07T20:26:28.0240306Z 
2025-05-07T20:26:28.0240311Z 
2025-05-07T20:26:28.0240316Z 
2025-05-07T20:26:28.0240321Z 
2025-05-07T20:26:28.0914156Z cuda-nvrtc-12.8.61   | 63.1 MB   | 9          |  10% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0914576Z 
2025-05-07T20:26:28.0914582Z 
2025-05-07T20:26:28.0914587Z 
2025-05-07T20:26:28.0914592Z 
2025-05-07T20:26:28.0914597Z 
2025-05-07T20:26:28.0914603Z 
2025-05-07T20:26:28.0914624Z 
2025-05-07T20:26:28.0914629Z 
2025-05-07T20:26:28.0916393Z 
2025-05-07T20:26:28.1240118Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1240546Z 
2025-05-07T20:26:28.1240552Z 
2025-05-07T20:26:28.1240557Z 
2025-05-07T20:26:28.1240562Z 
2025-05-07T20:26:28.1240568Z 
2025-05-07T20:26:28.1240573Z 
2025-05-07T20:26:28.1240578Z 
2025-05-07T20:26:28.1240583Z 
2025-05-07T20:26:28.1917117Z cuda-nvrtc-12.8.61   | 63.1 MB   | #4         |  15% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1917505Z 
2025-05-07T20:26:28.1917510Z 
2025-05-07T20:26:28.1917515Z 
2025-05-07T20:26:28.1917520Z 
2025-05-07T20:26:28.1917525Z 
2025-05-07T20:26:28.1917531Z 
2025-05-07T20:26:28.1917535Z 
2025-05-07T20:26:28.1917552Z 
2025-05-07T20:26:28.1917558Z 
2025-05-07T20:26:28.2244971Z libcurand-10.3.9.55  | 43.6 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2245360Z 
2025-05-07T20:26:28.2245366Z 
2025-05-07T20:26:28.2245371Z 
2025-05-07T20:26:28.2245384Z 
2025-05-07T20:26:28.2245646Z 
2025-05-07T20:26:28.2245652Z 
2025-05-07T20:26:28.2245657Z 
2025-05-07T20:26:28.2246931Z 
2025-05-07T20:26:28.2624024Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##         |  20% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2624425Z 
2025-05-07T20:26:28.2624430Z 
2025-05-07T20:26:28.2624435Z 
2025-05-07T20:26:28.2624441Z 
2025-05-07T20:26:28.2624446Z 
2025-05-07T20:26:28.2624776Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:28.2625134Z 
2025-05-07T20:26:28.2625140Z 
2025-05-07T20:26:28.2625145Z 
2025-05-07T20:26:28.2625151Z 
2025-05-07T20:26:28.2625155Z 
2025-05-07T20:26:28.2917562Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:28.2917941Z 
2025-05-07T20:26:28.2917946Z 
2025-05-07T20:26:28.2917952Z 
2025-05-07T20:26:28.2917957Z 
2025-05-07T20:26:28.2917963Z 
2025-05-07T20:26:28.2917968Z 
2025-05-07T20:26:28.2917973Z 
2025-05-07T20:26:28.2917979Z 
2025-05-07T20:26:28.2917985Z 
2025-05-07T20:26:28.3250161Z libcurand-10.3.9.55  | 43.6 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3250559Z 
2025-05-07T20:26:28.3250565Z 
2025-05-07T20:26:28.3250833Z 
2025-05-07T20:26:28.3250838Z 
2025-05-07T20:26:28.3250843Z 
2025-05-07T20:26:28.3250848Z 
2025-05-07T20:26:28.3250853Z 
2025-05-07T20:26:28.3255491Z 
2025-05-07T20:26:28.3274410Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##5        |  25% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3274802Z 
2025-05-07T20:26:28.3274807Z 
2025-05-07T20:26:28.3274812Z 
2025-05-07T20:26:28.3274817Z 
2025-05-07T20:26:28.3274822Z 
2025-05-07T20:26:28.3274827Z 
2025-05-07T20:26:28.3274832Z 
2025-05-07T20:26:28.3274838Z 
2025-05-07T20:26:28.3274843Z 
2025-05-07T20:26:28.3274848Z 
2025-05-07T20:26:28.4024359Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4024756Z 
2025-05-07T20:26:28.4024761Z 
2025-05-07T20:26:28.4024767Z 
2025-05-07T20:26:28.4024772Z 
2025-05-07T20:26:28.4024777Z 
2025-05-07T20:26:28.4024782Z 
2025-05-07T20:26:28.4024805Z 
2025-05-07T20:26:28.4024811Z 
2025-05-07T20:26:28.4030654Z 
2025-05-07T20:26:28.4287377Z libcurand-10.3.9.55  | 43.6 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4287793Z 
2025-05-07T20:26:28.4287799Z 
2025-05-07T20:26:28.4287804Z 
2025-05-07T20:26:28.4287809Z 
2025-05-07T20:26:28.4287823Z 
2025-05-07T20:26:28.4287829Z 
2025-05-07T20:26:28.4287833Z 
2025-05-07T20:26:28.4287839Z 
2025-05-07T20:26:28.4287843Z 
2025-05-07T20:26:28.4287848Z 
2025-05-07T20:26:28.4432520Z gds-tools-1.13.0.11  | 37.9 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4432922Z 
2025-05-07T20:26:28.4432928Z 
2025-05-07T20:26:28.4432933Z 
2025-05-07T20:26:28.4432938Z 
2025-05-07T20:26:28.4432943Z 
2025-05-07T20:26:28.4432948Z 
2025-05-07T20:26:28.4432953Z 
2025-05-07T20:26:28.4437754Z 
2025-05-07T20:26:28.5059215Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###        |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5059628Z 
2025-05-07T20:26:28.5059634Z 
2025-05-07T20:26:28.5059660Z 
2025-05-07T20:26:28.5059666Z 
2025-05-07T20:26:28.5059671Z 
2025-05-07T20:26:28.5059676Z 
2025-05-07T20:26:28.5059681Z 
2025-05-07T20:26:28.5059696Z 
2025-05-07T20:26:28.5059701Z 
2025-05-07T20:26:28.5302465Z libcurand-10.3.9.55  | 43.6 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5302868Z 
2025-05-07T20:26:28.5302873Z 
2025-05-07T20:26:28.5302878Z 
2025-05-07T20:26:28.5302883Z 
2025-05-07T20:26:28.5302888Z 
2025-05-07T20:26:28.5302893Z 
2025-05-07T20:26:28.5302898Z 
2025-05-07T20:26:28.5302904Z 
2025-05-07T20:26:28.5302909Z 
2025-05-07T20:26:28.5302914Z 
2025-05-07T20:26:28.5528946Z gds-tools-1.13.0.11  | 37.9 MB   | #3         |  13% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5529352Z 
2025-05-07T20:26:28.5529357Z 
2025-05-07T20:26:28.5529363Z 
2025-05-07T20:26:28.5529368Z 
2025-05-07T20:26:28.5529373Z 
2025-05-07T20:26:28.5529386Z 
2025-05-07T20:26:28.5529391Z 
2025-05-07T20:26:28.5533628Z 
2025-05-07T20:26:28.6095169Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###5       |  36% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6095577Z 
2025-05-07T20:26:28.6095582Z 
2025-05-07T20:26:28.6095587Z 
2025-05-07T20:26:28.6095606Z 
2025-05-07T20:26:28.6095611Z 
2025-05-07T20:26:28.6095616Z 
2025-05-07T20:26:28.6095621Z 
2025-05-07T20:26:28.6095626Z 
2025-05-07T20:26:28.6099327Z 
2025-05-07T20:26:28.6303406Z libcurand-10.3.9.55  | 43.6 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6303800Z 
2025-05-07T20:26:28.6303805Z 
2025-05-07T20:26:28.6303810Z 
2025-05-07T20:26:28.6303815Z 
2025-05-07T20:26:28.6303820Z 
2025-05-07T20:26:28.6303825Z 
2025-05-07T20:26:28.6303830Z 
2025-05-07T20:26:28.6303835Z 
2025-05-07T20:26:28.6303840Z 
2025-05-07T20:26:28.6307370Z 
2025-05-07T20:26:28.6644395Z gds-tools-1.13.0.11  | 37.9 MB   | ##         |  20% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6644789Z 
2025-05-07T20:26:28.6644794Z 
2025-05-07T20:26:28.6644800Z 
2025-05-07T20:26:28.6644805Z 
2025-05-07T20:26:28.6644810Z 
2025-05-07T20:26:28.6644815Z 
2025-05-07T20:26:28.6644834Z 
2025-05-07T20:26:28.6648273Z 
2025-05-07T20:26:28.7096891Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7097548Z 
2025-05-07T20:26:28.7097553Z 
2025-05-07T20:26:28.7097559Z 
2025-05-07T20:26:28.7097564Z 
2025-05-07T20:26:28.7097569Z 
2025-05-07T20:26:28.7097573Z 
2025-05-07T20:26:28.7097588Z 
2025-05-07T20:26:28.7097594Z 
2025-05-07T20:26:28.7097599Z 
2025-05-07T20:26:28.7304856Z libcurand-10.3.9.55  | 43.6 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7305253Z 
2025-05-07T20:26:28.7305259Z 
2025-05-07T20:26:28.7305274Z 
2025-05-07T20:26:28.7305279Z 
2025-05-07T20:26:28.7305284Z 
2025-05-07T20:26:28.7305289Z 
2025-05-07T20:26:28.7305294Z 
2025-05-07T20:26:28.7305299Z 
2025-05-07T20:26:28.7305304Z 
2025-05-07T20:26:28.7305416Z 
2025-05-07T20:26:28.7759429Z gds-tools-1.13.0.11  | 37.9 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7759833Z 
2025-05-07T20:26:28.7759839Z 
2025-05-07T20:26:28.7759873Z 
2025-05-07T20:26:28.7759878Z 
2025-05-07T20:26:28.7759883Z 
2025-05-07T20:26:28.7759888Z 
2025-05-07T20:26:28.7759893Z 
2025-05-07T20:26:28.7762744Z 
2025-05-07T20:26:28.8115882Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####4      |  45% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8116277Z 
2025-05-07T20:26:28.8116282Z 
2025-05-07T20:26:28.8116286Z 
2025-05-07T20:26:28.8116289Z 
2025-05-07T20:26:28.8116293Z 
2025-05-07T20:26:28.8116297Z 
2025-05-07T20:26:28.8116300Z 
2025-05-07T20:26:28.8116304Z 
2025-05-07T20:26:28.8116308Z 
2025-05-07T20:26:28.8305724Z libcurand-10.3.9.55  | 43.6 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8306135Z 
2025-05-07T20:26:28.8306141Z 
2025-05-07T20:26:28.8306146Z 
2025-05-07T20:26:28.8306151Z 
2025-05-07T20:26:28.8306156Z 
2025-05-07T20:26:28.8306162Z 
2025-05-07T20:26:28.8306167Z 
2025-05-07T20:26:28.8306172Z 
2025-05-07T20:26:28.8306177Z 
2025-05-07T20:26:28.8306182Z 
2025-05-07T20:26:28.8829681Z gds-tools-1.13.0.11  | 37.9 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8830072Z 
2025-05-07T20:26:28.8830078Z 
2025-05-07T20:26:28.8830083Z 
2025-05-07T20:26:28.8830101Z 
2025-05-07T20:26:28.8830106Z 
2025-05-07T20:26:28.8830111Z 
2025-05-07T20:26:28.8830116Z 
2025-05-07T20:26:28.8830898Z 
2025-05-07T20:26:28.9222451Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####9      |  49% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9222839Z 
2025-05-07T20:26:28.9222845Z 
2025-05-07T20:26:28.9222850Z 
2025-05-07T20:26:28.9222855Z 
2025-05-07T20:26:28.9222860Z 
2025-05-07T20:26:28.9222865Z 
2025-05-07T20:26:28.9222883Z 
2025-05-07T20:26:28.9222888Z 
2025-05-07T20:26:28.9224517Z 
2025-05-07T20:26:28.9353330Z libcurand-10.3.9.55  | 43.6 MB   | #####4     |  55% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9353720Z 
2025-05-07T20:26:28.9353734Z 
2025-05-07T20:26:28.9353739Z 
2025-05-07T20:26:28.9353745Z 
2025-05-07T20:26:28.9353750Z 
2025-05-07T20:26:28.9353755Z 
2025-05-07T20:26:28.9353760Z 
2025-05-07T20:26:28.9353765Z 
2025-05-07T20:26:28.9354028Z 
2025-05-07T20:26:28.9354035Z 
2025-05-07T20:26:28.9859234Z gds-tools-1.13.0.11  | 37.9 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9859644Z 
2025-05-07T20:26:28.9859649Z 
2025-05-07T20:26:28.9859654Z 
2025-05-07T20:26:28.9859660Z 
2025-05-07T20:26:28.9859665Z 
2025-05-07T20:26:28.9859670Z 
2025-05-07T20:26:28.9859675Z 
2025-05-07T20:26:28.9861139Z 
2025-05-07T20:26:29.0261196Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####3     |  54% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0261593Z 
2025-05-07T20:26:29.0261598Z 
2025-05-07T20:26:29.0261604Z 
2025-05-07T20:26:29.0261609Z 
2025-05-07T20:26:29.0261614Z 
2025-05-07T20:26:29.0261620Z 
2025-05-07T20:26:29.0261625Z 
2025-05-07T20:26:29.0261630Z 
2025-05-07T20:26:29.0261635Z 
2025-05-07T20:26:29.0355001Z libcurand-10.3.9.55  | 43.6 MB   | ######1    |  61% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0355418Z 
2025-05-07T20:26:29.0355423Z 
2025-05-07T20:26:29.0355429Z 
2025-05-07T20:26:29.0355434Z 
2025-05-07T20:26:29.0355450Z 
2025-05-07T20:26:29.0355455Z 
2025-05-07T20:26:29.0355460Z 
2025-05-07T20:26:29.0355465Z 
2025-05-07T20:26:29.0355559Z 
2025-05-07T20:26:29.0357523Z 
2025-05-07T20:26:29.0875366Z gds-tools-1.13.0.11  | 37.9 MB   | ####9      |  49% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0875880Z 
2025-05-07T20:26:29.0875886Z 
2025-05-07T20:26:29.0875891Z 
2025-05-07T20:26:29.0875896Z 
2025-05-07T20:26:29.0875901Z 
2025-05-07T20:26:29.0875906Z 
2025-05-07T20:26:29.0875911Z 
2025-05-07T20:26:29.0875916Z 
2025-05-07T20:26:29.1269447Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####8     |  58% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1269838Z 
2025-05-07T20:26:29.1269844Z 
2025-05-07T20:26:29.1269849Z 
2025-05-07T20:26:29.1269854Z 
2025-05-07T20:26:29.1269869Z 
2025-05-07T20:26:29.1269874Z 
2025-05-07T20:26:29.1269879Z 
2025-05-07T20:26:29.1269884Z 
2025-05-07T20:26:29.1269890Z 
2025-05-07T20:26:29.1449537Z libcurand-10.3.9.55  | 43.6 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1449964Z 
2025-05-07T20:26:29.1449970Z 
2025-05-07T20:26:29.1449975Z 
2025-05-07T20:26:29.1449980Z 
2025-05-07T20:26:29.1449985Z 
2025-05-07T20:26:29.1450003Z 
2025-05-07T20:26:29.1450008Z 
2025-05-07T20:26:29.1450013Z 
2025-05-07T20:26:29.1450018Z 
2025-05-07T20:26:29.1451595Z 
2025-05-07T20:26:29.1880230Z gds-tools-1.13.0.11  | 37.9 MB   | #####6     |  57% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1880573Z 
2025-05-07T20:26:29.1880577Z 
2025-05-07T20:26:29.1880581Z 
2025-05-07T20:26:29.1880585Z 
2025-05-07T20:26:29.1880588Z 
2025-05-07T20:26:29.1880599Z 
2025-05-07T20:26:29.1880603Z 
2025-05-07T20:26:29.1880606Z 
2025-05-07T20:26:29.2307799Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######2    |  63% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2308102Z 
2025-05-07T20:26:29.2308114Z 
2025-05-07T20:26:29.2308118Z 
2025-05-07T20:26:29.2308121Z 
2025-05-07T20:26:29.2308125Z 
2025-05-07T20:26:29.2308129Z 
2025-05-07T20:26:29.2308132Z 
2025-05-07T20:26:29.2308136Z 
2025-05-07T20:26:29.2308140Z 
2025-05-07T20:26:29.2466793Z libcurand-10.3.9.55  | 43.6 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2467171Z 
2025-05-07T20:26:29.2467195Z 
2025-05-07T20:26:29.2467200Z 
2025-05-07T20:26:29.2467205Z 
2025-05-07T20:26:29.2467210Z 
2025-05-07T20:26:29.2467215Z 
2025-05-07T20:26:29.2467221Z 
2025-05-07T20:26:29.2467225Z 
2025-05-07T20:26:29.2467230Z 
2025-05-07T20:26:29.2467236Z 
2025-05-07T20:26:29.2923921Z gds-tools-1.13.0.11  | 37.9 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2924228Z 
2025-05-07T20:26:29.2924232Z 
2025-05-07T20:26:29.2924235Z 
2025-05-07T20:26:29.2924239Z 
2025-05-07T20:26:29.2924243Z 
2025-05-07T20:26:29.2924247Z 
2025-05-07T20:26:29.2924250Z 
2025-05-07T20:26:29.2929995Z 
2025-05-07T20:26:29.3308566Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######7    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3308951Z 
2025-05-07T20:26:29.3308957Z 
2025-05-07T20:26:29.3308962Z 
2025-05-07T20:26:29.3308966Z 
2025-05-07T20:26:29.3308971Z 
2025-05-07T20:26:29.3309242Z 
2025-05-07T20:26:29.3309249Z 
2025-05-07T20:26:29.3309254Z 
2025-05-07T20:26:29.3309259Z 
2025-05-07T20:26:29.3469749Z libcurand-10.3.9.55  | 43.6 MB   | ########1  |  81% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3470167Z 
2025-05-07T20:26:29.3470172Z 
2025-05-07T20:26:29.3470177Z 
2025-05-07T20:26:29.3470181Z 
2025-05-07T20:26:29.3470186Z 
2025-05-07T20:26:29.3470198Z 
2025-05-07T20:26:29.3470203Z 
2025-05-07T20:26:29.3470207Z 
2025-05-07T20:26:29.3470212Z 
2025-05-07T20:26:29.3471344Z 
2025-05-07T20:26:29.4038521Z gds-tools-1.13.0.11  | 37.9 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4038865Z 
2025-05-07T20:26:29.4038869Z 
2025-05-07T20:26:29.4038873Z 
2025-05-07T20:26:29.4038877Z 
2025-05-07T20:26:29.4038880Z 
2025-05-07T20:26:29.4038884Z 
2025-05-07T20:26:29.4038888Z 
2025-05-07T20:26:29.4040202Z 
2025-05-07T20:26:29.4356908Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######1   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4357247Z 
2025-05-07T20:26:29.4357281Z 
2025-05-07T20:26:29.4357285Z 
2025-05-07T20:26:29.4357289Z 
2025-05-07T20:26:29.4357293Z 
2025-05-07T20:26:29.4357296Z 
2025-05-07T20:26:29.4357534Z 
2025-05-07T20:26:29.4357537Z 
2025-05-07T20:26:29.4360022Z 
2025-05-07T20:26:29.4498482Z libcurand-10.3.9.55  | 43.6 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4498858Z 
2025-05-07T20:26:29.4498862Z 
2025-05-07T20:26:29.4498866Z 
2025-05-07T20:26:29.4498869Z 
2025-05-07T20:26:29.4498873Z 
2025-05-07T20:26:29.4498877Z 
2025-05-07T20:26:29.4498880Z 
2025-05-07T20:26:29.4498884Z 
2025-05-07T20:26:29.4498887Z 
2025-05-07T20:26:29.4498891Z 
2025-05-07T20:26:29.5038793Z gds-tools-1.13.0.11  | 37.9 MB   | #######8   |  78% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5039109Z 
2025-05-07T20:26:29.5039113Z 
2025-05-07T20:26:29.5039116Z 
2025-05-07T20:26:29.5039120Z 
2025-05-07T20:26:29.5039124Z 
2025-05-07T20:26:29.5039127Z 
2025-05-07T20:26:29.5039131Z 
2025-05-07T20:26:29.5040588Z 
2025-05-07T20:26:29.5452861Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######6   |  76% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5453192Z 
2025-05-07T20:26:29.5453196Z 
2025-05-07T20:26:29.5453211Z 
2025-05-07T20:26:29.5453215Z 
2025-05-07T20:26:29.5453218Z 
2025-05-07T20:26:29.5453222Z 
2025-05-07T20:26:29.5453225Z 
2025-05-07T20:26:29.5453229Z 
2025-05-07T20:26:29.5458167Z 
2025-05-07T20:26:29.5512953Z libcurand-10.3.9.55  | 43.6 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5513332Z 
2025-05-07T20:26:29.5513338Z 
2025-05-07T20:26:29.5513353Z 
2025-05-07T20:26:29.5513359Z 
2025-05-07T20:26:29.5513363Z 
2025-05-07T20:26:29.5513366Z 
2025-05-07T20:26:29.5513370Z 
2025-05-07T20:26:29.5513373Z 
2025-05-07T20:26:29.5513377Z 
2025-05-07T20:26:29.5513380Z 
2025-05-07T20:26:29.6040830Z gds-tools-1.13.0.11  | 37.9 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6041197Z 
2025-05-07T20:26:29.6041201Z 
2025-05-07T20:26:29.6041205Z 
2025-05-07T20:26:29.6041208Z 
2025-05-07T20:26:29.6041212Z 
2025-05-07T20:26:29.6041228Z 
2025-05-07T20:26:29.6041232Z 
2025-05-07T20:26:29.6043476Z 
2025-05-07T20:26:29.7043016Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########   |  81% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7043332Z 
2025-05-07T20:26:29.7043335Z 
2025-05-07T20:26:29.7043339Z 
2025-05-07T20:26:29.7043342Z 
2025-05-07T20:26:29.7043346Z 
2025-05-07T20:26:29.7043349Z 
2025-05-07T20:26:29.7043353Z 
2025-05-07T20:26:29.7056376Z 
2025-05-07T20:26:29.7797480Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########6  |  86% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7797771Z 
2025-05-07T20:26:29.7797775Z 
2025-05-07T20:26:29.7797779Z 
2025-05-07T20:26:29.7797782Z 
2025-05-07T20:26:29.7797786Z 
2025-05-07T20:26:29.7797789Z 
2025-05-07T20:26:29.7797793Z 
2025-05-07T20:26:29.7797797Z 
2025-05-07T20:26:29.7797800Z 
2025-05-07T20:26:29.7797804Z 
2025-05-07T20:26:29.8045313Z gds-tools-1.13.0.11  | 37.9 MB   | #########2 |  92% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8045671Z 
2025-05-07T20:26:29.8045675Z 
2025-05-07T20:26:29.8045888Z 
2025-05-07T20:26:29.8045894Z 
2025-05-07T20:26:29.8045897Z 
2025-05-07T20:26:29.8045910Z 
2025-05-07T20:26:29.8045913Z 
2025-05-07T20:26:29.8047392Z 
2025-05-07T20:26:29.8800278Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########1 |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8800683Z 
2025-05-07T20:26:29.8800690Z 
2025-05-07T20:26:29.8800694Z 
2025-05-07T20:26:29.8800700Z 
2025-05-07T20:26:29.8800705Z 
2025-05-07T20:26:29.8800710Z 
2025-05-07T20:26:29.8800715Z 
2025-05-07T20:26:29.8800720Z 
2025-05-07T20:26:29.8800725Z 
2025-05-07T20:26:29.8800730Z 
2025-05-07T20:26:29.9062912Z gds-tools-1.13.0.11  | 37.9 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9063359Z 
2025-05-07T20:26:29.9063365Z 
2025-05-07T20:26:29.9063370Z 
2025-05-07T20:26:29.9063375Z 
2025-05-07T20:26:29.9063380Z 
2025-05-07T20:26:29.9063385Z 
2025-05-07T20:26:29.9063390Z 
2025-05-07T20:26:29.9063395Z 
2025-05-07T20:26:31.1497596Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########6 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1498042Z 
2025-05-07T20:26:31.1498049Z 
2025-05-07T20:26:31.1498055Z 
2025-05-07T20:26:31.1498068Z 
2025-05-07T20:26:31.1498338Z 
2025-05-07T20:26:31.1498341Z 
2025-05-07T20:26:31.1498345Z 
2025-05-07T20:26:31.1498349Z 
2025-05-07T20:26:31.1498353Z 
2025-05-07T20:26:31.1498356Z 
2025-05-07T20:26:31.1907427Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1907839Z 
2025-05-07T20:26:31.1907844Z 
2025-05-07T20:26:31.1907849Z 
2025-05-07T20:26:31.1907854Z 
2025-05-07T20:26:31.1907859Z 
2025-05-07T20:26:31.1907864Z 
2025-05-07T20:26:31.1907870Z 
2025-05-07T20:26:31.1907875Z 
2025-05-07T20:26:31.1909149Z 
2025-05-07T20:26:31.1912982Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1913373Z 
2025-05-07T20:26:31.1913380Z 
2025-05-07T20:26:31.1913385Z 
2025-05-07T20:26:31.1913390Z 
2025-05-07T20:26:31.1913395Z 
2025-05-07T20:26:31.1913400Z 
2025-05-07T20:26:31.1913406Z 
2025-05-07T20:26:31.1913432Z 
2025-05-07T20:26:31.1913438Z 
2025-05-07T20:26:31.1913443Z 
2025-05-07T20:26:31.1922474Z 
2025-05-07T20:26:31.2472540Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.2472973Z 
2025-05-07T20:26:31.2472978Z 
2025-05-07T20:26:31.2472984Z 
2025-05-07T20:26:31.2472989Z 
2025-05-07T20:26:31.2472994Z 
2025-05-07T20:26:31.2472999Z 
2025-05-07T20:26:31.2473005Z 
2025-05-07T20:26:31.2473010Z 
2025-05-07T20:26:31.2473026Z 
2025-05-07T20:26:31.2473032Z 
2025-05-07T20:26:31.2473038Z 
2025-05-07T20:26:31.2473047Z 
2025-05-07T20:26:31.2908893Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.2909226Z 
2025-05-07T20:26:31.2909230Z 
2025-05-07T20:26:31.2909234Z 
2025-05-07T20:26:31.2909238Z 
2025-05-07T20:26:31.2909241Z 
2025-05-07T20:26:31.2909245Z 
2025-05-07T20:26:31.2909249Z 
2025-05-07T20:26:31.2909252Z 
2025-05-07T20:26:31.2909256Z 
2025-05-07T20:26:31.2909260Z 
2025-05-07T20:26:31.2911504Z 
2025-05-07T20:26:31.3478863Z libnvjitlink-12.8.61 | 28.7 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.3479199Z 
2025-05-07T20:26:31.3479203Z 
2025-05-07T20:26:31.3479207Z 
2025-05-07T20:26:31.3479211Z 
2025-05-07T20:26:31.3479214Z 
2025-05-07T20:26:31.3479218Z 
2025-05-07T20:26:31.3479222Z 
2025-05-07T20:26:31.3479225Z 
2025-05-07T20:26:31.3479229Z 
2025-05-07T20:26:31.3479233Z 
2025-05-07T20:26:31.3479236Z 
2025-05-07T20:26:31.3479240Z 
2025-05-07T20:26:31.3917043Z cuda-nvcc-tools-12.8 | 24.5 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.3917358Z 
2025-05-07T20:26:31.3917363Z 
2025-05-07T20:26:31.3917366Z 
2025-05-07T20:26:31.3917370Z 
2025-05-07T20:26:31.3917374Z 
2025-05-07T20:26:31.3917378Z 
2025-05-07T20:26:31.3917381Z 
2025-05-07T20:26:31.3917392Z 
2025-05-07T20:26:31.3917396Z 
2025-05-07T20:26:31.3917399Z 
2025-05-07T20:26:31.3917431Z 
2025-05-07T20:26:31.4137712Z libnvjitlink-12.8.61 | 28.7 MB   | ##         |  21% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.4138143Z 
2025-05-07T20:26:31.4138150Z 
2025-05-07T20:26:31.4138895Z 
2025-05-07T20:26:31.4490000Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:31.4490336Z 
2025-05-07T20:26:31.4490341Z 
2025-05-07T20:26:31.4490345Z 
2025-05-07T20:26:31.4490348Z 
2025-05-07T20:26:31.4490352Z 
2025-05-07T20:26:31.4490356Z 
2025-05-07T20:26:31.4490359Z 
2025-05-07T20:26:31.4490363Z 
2025-05-07T20:26:31.4490366Z 
2025-05-07T20:26:31.4490370Z 
2025-05-07T20:26:31.4490374Z 
2025-05-07T20:26:31.4493133Z 
2025-05-07T20:26:31.4921590Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.4922093Z 
2025-05-07T20:26:31.4922097Z 
2025-05-07T20:26:31.4922101Z 
2025-05-07T20:26:31.4922105Z 
2025-05-07T20:26:31.4922108Z 
2025-05-07T20:26:31.4922112Z 
2025-05-07T20:26:31.4922115Z 
2025-05-07T20:26:31.4922119Z 
2025-05-07T20:26:31.4922123Z 
2025-05-07T20:26:31.4922126Z 
2025-05-07T20:26:31.4922130Z 
2025-05-07T20:26:31.5559155Z libnvjitlink-12.8.61 | 28.7 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.5559553Z 
2025-05-07T20:26:31.5559805Z 
2025-05-07T20:26:31.5559809Z 
2025-05-07T20:26:31.5559813Z 
2025-05-07T20:26:31.5559816Z 
2025-05-07T20:26:31.5559820Z 
2025-05-07T20:26:31.5559823Z 
2025-05-07T20:26:31.5559827Z 
2025-05-07T20:26:31.5559831Z 
2025-05-07T20:26:31.5559844Z 
2025-05-07T20:26:31.5559847Z 
2025-05-07T20:26:31.5563009Z 
2025-05-07T20:26:31.5924267Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.5924619Z 
2025-05-07T20:26:31.5924623Z 
2025-05-07T20:26:31.5924627Z 
2025-05-07T20:26:31.5924630Z 
2025-05-07T20:26:31.5924634Z 
2025-05-07T20:26:31.5924638Z 
2025-05-07T20:26:31.5924641Z 
2025-05-07T20:26:31.5924645Z 
2025-05-07T20:26:31.5924649Z 
2025-05-07T20:26:31.5924652Z 
2025-05-07T20:26:31.5927694Z 
2025-05-07T20:26:31.6568037Z libnvjitlink-12.8.61 | 28.7 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.6568470Z 
2025-05-07T20:26:31.6568474Z 
2025-05-07T20:26:31.6568478Z 
2025-05-07T20:26:31.6568481Z 
2025-05-07T20:26:31.6568499Z 
2025-05-07T20:26:31.6568503Z 
2025-05-07T20:26:31.6568506Z 
2025-05-07T20:26:31.6568510Z 
2025-05-07T20:26:31.6568514Z 
2025-05-07T20:26:31.6568517Z 
2025-05-07T20:26:31.6568521Z 
2025-05-07T20:26:31.6568524Z 
2025-05-07T20:26:31.6969796Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.6970168Z 
2025-05-07T20:26:31.6970172Z 
2025-05-07T20:26:31.6970176Z 
2025-05-07T20:26:31.6970179Z 
2025-05-07T20:26:31.6970183Z 
2025-05-07T20:26:31.6970187Z 
2025-05-07T20:26:31.6970190Z 
2025-05-07T20:26:31.6970203Z 
2025-05-07T20:26:31.6970207Z 
2025-05-07T20:26:31.6970211Z 
2025-05-07T20:26:31.6970214Z 
2025-05-07T20:26:31.7572644Z libnvjitlink-12.8.61 | 28.7 MB   | #####2     |  53% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.7572969Z 
2025-05-07T20:26:31.7572973Z 
2025-05-07T20:26:31.7572976Z 
2025-05-07T20:26:31.7573002Z 
2025-05-07T20:26:31.7573006Z 
2025-05-07T20:26:31.7573009Z 
2025-05-07T20:26:31.7573013Z 
2025-05-07T20:26:31.7573017Z 
2025-05-07T20:26:31.7573030Z 
2025-05-07T20:26:31.7573033Z 
2025-05-07T20:26:31.7573037Z 
2025-05-07T20:26:31.7573040Z 
2025-05-07T20:26:31.7788324Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.7788649Z 
2025-05-07T20:26:31.7790379Z 
2025-05-07T20:26:31.8161325Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:31.8161780Z 
2025-05-07T20:26:31.8161788Z 
2025-05-07T20:26:31.8161796Z 
2025-05-07T20:26:31.8161803Z 
2025-05-07T20:26:31.8161810Z 
2025-05-07T20:26:31.8161815Z 
2025-05-07T20:26:31.8161821Z 
2025-05-07T20:26:31.8161826Z 
2025-05-07T20:26:31.8161832Z 
2025-05-07T20:26:31.8161836Z 
2025-05-07T20:26:31.8165060Z 
2025-05-07T20:26:31.8576706Z libnvjitlink-12.8.61 | 28.7 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.8577179Z 
2025-05-07T20:26:31.8577564Z 
2025-05-07T20:26:31.8577575Z 
2025-05-07T20:26:31.8577580Z 
2025-05-07T20:26:31.8577586Z 
2025-05-07T20:26:31.8577591Z 
2025-05-07T20:26:31.8577611Z 
2025-05-07T20:26:31.8577617Z 
2025-05-07T20:26:31.8577623Z 
2025-05-07T20:26:31.8577628Z 
2025-05-07T20:26:31.8577647Z 
2025-05-07T20:26:31.8577653Z 
2025-05-07T20:26:31.9242436Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.9242761Z 
2025-05-07T20:26:31.9242765Z 
2025-05-07T20:26:31.9242777Z 
2025-05-07T20:26:31.9242780Z 
2025-05-07T20:26:31.9242784Z 
2025-05-07T20:26:31.9242788Z 
2025-05-07T20:26:31.9242791Z 
2025-05-07T20:26:31.9242795Z 
2025-05-07T20:26:31.9242799Z 
2025-05-07T20:26:31.9242803Z 
2025-05-07T20:26:31.9244159Z 
2025-05-07T20:26:31.9577385Z libnvjitlink-12.8.61 | 28.7 MB   | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.9577717Z 
2025-05-07T20:26:31.9577721Z 
2025-05-07T20:26:31.9577725Z 
2025-05-07T20:26:31.9577728Z 
2025-05-07T20:26:31.9577757Z 
2025-05-07T20:26:31.9577761Z 
2025-05-07T20:26:31.9577765Z 
2025-05-07T20:26:31.9577768Z 
2025-05-07T20:26:31.9577772Z 
2025-05-07T20:26:31.9578051Z 
2025-05-07T20:26:31.9578057Z 
2025-05-07T20:26:31.9582798Z 
2025-05-07T20:26:31.9741294Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.9742837Z 
2025-05-07T20:26:32.0243391Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:32.0243674Z 
2025-05-07T20:26:32.0243678Z 
2025-05-07T20:26:32.0243681Z 
2025-05-07T20:26:32.0243685Z 
2025-05-07T20:26:32.0243689Z 
2025-05-07T20:26:32.0243692Z 
2025-05-07T20:26:32.0243697Z 
2025-05-07T20:26:32.0243701Z 
2025-05-07T20:26:32.0243704Z 
2025-05-07T20:26:32.0243708Z 
2025-05-07T20:26:32.0243711Z 
2025-05-07T20:26:32.0343022Z libnvjitlink-12.8.61 | 28.7 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.0343322Z 
2025-05-07T20:26:32.0343326Z 
2025-05-07T20:26:32.0343330Z 
2025-05-07T20:26:32.0343333Z 
2025-05-07T20:26:32.0343362Z 
2025-05-07T20:26:32.0343367Z 
2025-05-07T20:26:32.0343377Z 
2025-05-07T20:26:32.0343381Z 
2025-05-07T20:26:32.0343384Z 
2025-05-07T20:26:32.0343400Z 
2025-05-07T20:26:32.0343403Z 
2025-05-07T20:26:32.0343407Z 
2025-05-07T20:26:32.0344776Z 
2025-05-07T20:26:32.1006972Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.1007294Z 
2025-05-07T20:26:32.1007298Z 
2025-05-07T20:26:32.1007302Z 
2025-05-07T20:26:32.1007305Z 
2025-05-07T20:26:32.1007309Z 
2025-05-07T20:26:32.1007313Z 
2025-05-07T20:26:32.1007316Z 
2025-05-07T20:26:32.1007320Z 
2025-05-07T20:26:32.1007323Z 
2025-05-07T20:26:32.1007327Z 
2025-05-07T20:26:32.1007331Z 
2025-05-07T20:26:32.1009049Z 
2025-05-07T20:26:32.1345032Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########1 |  92% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.1345365Z 
2025-05-07T20:26:32.1345369Z 
2025-05-07T20:26:32.1345373Z 
2025-05-07T20:26:32.1345377Z 
2025-05-07T20:26:32.1345380Z 
2025-05-07T20:26:32.1345401Z 
2025-05-07T20:26:32.1345405Z 
2025-05-07T20:26:32.1345409Z 
2025-05-07T20:26:32.1345413Z 
2025-05-07T20:26:32.1345416Z 
2025-05-07T20:26:32.1345430Z 
2025-05-07T20:26:32.1345433Z 
2025-05-07T20:26:32.1347996Z 
2025-05-07T20:26:32.1371087Z cuda-nvvm-tools-12.8 | 23.5 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.1371427Z 
2025-05-07T20:26:32.1371431Z 
2025-05-07T20:26:32.1371435Z 
2025-05-07T20:26:32.1371446Z 
2025-05-07T20:26:32.1371450Z 
2025-05-07T20:26:32.1371454Z 
2025-05-07T20:26:32.1371457Z 
2025-05-07T20:26:32.1371461Z 
2025-05-07T20:26:32.1371464Z 
2025-05-07T20:26:32.1371468Z 
2025-05-07T20:26:32.1371471Z 
2025-05-07T20:26:32.1730227Z libnvjitlink-12.8.61 | 28.7 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.1730650Z 
2025-05-07T20:26:32.1730656Z 
2025-05-07T20:26:32.1730661Z 
2025-05-07T20:26:32.1730666Z 
2025-05-07T20:26:32.1730672Z 
2025-05-07T20:26:32.1730677Z 
2025-05-07T20:26:32.1730682Z 
2025-05-07T20:26:32.1734721Z 
2025-05-07T20:26:32.2131724Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:32.2132032Z 
2025-05-07T20:26:32.2132036Z 
2025-05-07T20:26:32.2132039Z 
2025-05-07T20:26:32.2132043Z 
2025-05-07T20:26:32.2132047Z 
2025-05-07T20:26:32.2132050Z 
2025-05-07T20:26:32.2132054Z 
2025-05-07T20:26:32.2132057Z 
2025-05-07T20:26:32.2132061Z 
2025-05-07T20:26:32.2132065Z 
2025-05-07T20:26:32.2132068Z 
2025-05-07T20:26:32.2132072Z 
2025-05-07T20:26:32.2132075Z 
2025-05-07T20:26:32.2133680Z 
2025-05-07T20:26:32.2349325Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.2349638Z 
2025-05-07T20:26:32.2349642Z 
2025-05-07T20:26:32.2349646Z 
2025-05-07T20:26:32.2349649Z 
2025-05-07T20:26:32.2349660Z 
2025-05-07T20:26:32.2349664Z 
2025-05-07T20:26:32.2349667Z 
2025-05-07T20:26:32.2349671Z 
2025-05-07T20:26:32.2349674Z 
2025-05-07T20:26:32.2349678Z 
2025-05-07T20:26:32.2349681Z 
2025-05-07T20:26:32.2349685Z 
2025-05-07T20:26:32.2349696Z 
2025-05-07T20:26:32.3135069Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##3        |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.3135641Z 
2025-05-07T20:26:32.3135644Z 
2025-05-07T20:26:32.3135648Z 
2025-05-07T20:26:32.3135651Z 
2025-05-07T20:26:32.3135655Z 
2025-05-07T20:26:32.3135658Z 
2025-05-07T20:26:32.3135662Z 
2025-05-07T20:26:32.3135665Z 
2025-05-07T20:26:32.3135669Z 
2025-05-07T20:26:32.3135672Z 
2025-05-07T20:26:32.3135676Z 
2025-05-07T20:26:32.3135679Z 
2025-05-07T20:26:32.3135683Z 
2025-05-07T20:26:32.3137296Z 
2025-05-07T20:26:32.3401990Z cuda-nvvm-impl-12.8. | 20.8 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.3402352Z 
2025-05-07T20:26:32.3402357Z 
2025-05-07T20:26:32.3402360Z 
2025-05-07T20:26:32.3402364Z 
2025-05-07T20:26:32.3402368Z 
2025-05-07T20:26:32.3402371Z 
2025-05-07T20:26:32.3402375Z 
2025-05-07T20:26:32.3402379Z 
2025-05-07T20:26:32.3402382Z 
2025-05-07T20:26:32.3402393Z 
2025-05-07T20:26:32.3402396Z 
2025-05-07T20:26:32.3402408Z 
2025-05-07T20:26:32.3410225Z 
2025-05-07T20:26:32.4141275Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.4141741Z 
2025-05-07T20:26:32.4141764Z 
2025-05-07T20:26:32.4141771Z 
2025-05-07T20:26:32.4141777Z 
2025-05-07T20:26:32.4141783Z 
2025-05-07T20:26:32.4141789Z 
2025-05-07T20:26:32.4141795Z 
2025-05-07T20:26:32.4141802Z 
2025-05-07T20:26:32.4141808Z 
2025-05-07T20:26:32.4141814Z 
2025-05-07T20:26:32.4141820Z 
2025-05-07T20:26:32.4141825Z 
2025-05-07T20:26:32.4141830Z 
2025-05-07T20:26:32.4146294Z 
2025-05-07T20:26:32.4404486Z cuda-nvvm-impl-12.8. | 20.8 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.4404882Z 
2025-05-07T20:26:32.4404886Z 
2025-05-07T20:26:32.4404889Z 
2025-05-07T20:26:32.4404893Z 
2025-05-07T20:26:32.4404897Z 
2025-05-07T20:26:32.4404900Z 
2025-05-07T20:26:32.4404904Z 
2025-05-07T20:26:32.4404907Z 
2025-05-07T20:26:32.4404911Z 
2025-05-07T20:26:32.4404924Z 
2025-05-07T20:26:32.4404928Z 
2025-05-07T20:26:32.4404939Z 
2025-05-07T20:26:32.4404942Z 
2025-05-07T20:26:32.5202436Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.5202851Z 
2025-05-07T20:26:32.5202855Z 
2025-05-07T20:26:32.5202858Z 
2025-05-07T20:26:32.5202869Z 
2025-05-07T20:26:32.5202872Z 
2025-05-07T20:26:32.5202876Z 
2025-05-07T20:26:32.5202879Z 
2025-05-07T20:26:32.5202883Z 
2025-05-07T20:26:32.5202886Z 
2025-05-07T20:26:32.5202890Z 
2025-05-07T20:26:32.5202894Z 
2025-05-07T20:26:32.5202897Z 
2025-05-07T20:26:32.5202901Z 
2025-05-07T20:26:32.5205905Z 
2025-05-07T20:26:32.5528490Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.5528810Z 
2025-05-07T20:26:32.5528814Z 
2025-05-07T20:26:32.5528818Z 
2025-05-07T20:26:32.5528821Z 
2025-05-07T20:26:32.5528825Z 
2025-05-07T20:26:32.5528829Z 
2025-05-07T20:26:32.5528839Z 
2025-05-07T20:26:32.5528843Z 
2025-05-07T20:26:32.5529035Z 
2025-05-07T20:26:32.5529040Z 
2025-05-07T20:26:32.5529044Z 
2025-05-07T20:26:32.5529047Z 
2025-05-07T20:26:32.5529051Z 
2025-05-07T20:26:32.6204513Z cuda-nvvm-tools-12.8 | 23.5 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.6204833Z 
2025-05-07T20:26:32.6204837Z 
2025-05-07T20:26:32.6204840Z 
2025-05-07T20:26:32.6204844Z 
2025-05-07T20:26:32.6204847Z 
2025-05-07T20:26:32.6204851Z 
2025-05-07T20:26:32.6204854Z 
2025-05-07T20:26:32.6204858Z 
2025-05-07T20:26:32.6204861Z 
2025-05-07T20:26:32.6204865Z 
2025-05-07T20:26:32.6204868Z 
2025-05-07T20:26:32.6204872Z 
2025-05-07T20:26:32.6204875Z 
2025-05-07T20:26:32.6209705Z 
2025-05-07T20:26:32.6532678Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####7     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.6532988Z 
2025-05-07T20:26:32.6532992Z 
2025-05-07T20:26:32.6532995Z 
2025-05-07T20:26:32.6532999Z 
2025-05-07T20:26:32.6533002Z 
2025-05-07T20:26:32.6533006Z 
2025-05-07T20:26:32.6533009Z 
2025-05-07T20:26:32.6533022Z 
2025-05-07T20:26:32.6533026Z 
2025-05-07T20:26:32.6533036Z 
2025-05-07T20:26:32.6533040Z 
2025-05-07T20:26:32.6533043Z 
2025-05-07T20:26:32.6534563Z 
2025-05-07T20:26:32.7228184Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######1   |  72% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.7228516Z 
2025-05-07T20:26:32.7228519Z 
2025-05-07T20:26:32.7228523Z 
2025-05-07T20:26:32.7228526Z 
2025-05-07T20:26:32.7228530Z 
2025-05-07T20:26:32.7228533Z 
2025-05-07T20:26:32.7228537Z 
2025-05-07T20:26:32.7228540Z 
2025-05-07T20:26:32.7228544Z 
2025-05-07T20:26:32.7228548Z 
2025-05-07T20:26:32.7228551Z 
2025-05-07T20:26:32.7228555Z 
2025-05-07T20:26:32.7228558Z 
2025-05-07T20:26:32.7229974Z 
2025-05-07T20:26:32.7554532Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.7554842Z 
2025-05-07T20:26:32.7554846Z 
2025-05-07T20:26:32.7554849Z 
2025-05-07T20:26:32.7554853Z 
2025-05-07T20:26:32.7554856Z 
2025-05-07T20:26:32.7554860Z 
2025-05-07T20:26:32.7554871Z 
2025-05-07T20:26:32.7554875Z 
2025-05-07T20:26:32.7554879Z 
2025-05-07T20:26:32.7554882Z 
2025-05-07T20:26:32.7554886Z 
2025-05-07T20:26:32.7554903Z 
2025-05-07T20:26:32.7556352Z 
2025-05-07T20:26:32.8228895Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########3  |  83% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.8229202Z 
2025-05-07T20:26:32.8229582Z 
2025-05-07T20:26:32.8229587Z 
2025-05-07T20:26:32.8229783Z 
2025-05-07T20:26:32.8229792Z 
2025-05-07T20:26:32.8229796Z 
2025-05-07T20:26:32.8229800Z 
2025-05-07T20:26:32.8229804Z 
2025-05-07T20:26:32.8229807Z 
2025-05-07T20:26:32.8229811Z 
2025-05-07T20:26:32.8229815Z 
2025-05-07T20:26:32.8229819Z 
2025-05-07T20:26:32.8229822Z 
2025-05-07T20:26:32.8235628Z 
2025-05-07T20:26:32.8562723Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.8563055Z 
2025-05-07T20:26:32.8563059Z 
2025-05-07T20:26:32.8563070Z 
2025-05-07T20:26:32.8563074Z 
2025-05-07T20:26:32.8563077Z 
2025-05-07T20:26:32.8563096Z 
2025-05-07T20:26:32.8563100Z 
2025-05-07T20:26:32.8563104Z 
2025-05-07T20:26:32.8563108Z 
2025-05-07T20:26:32.8563111Z 
2025-05-07T20:26:32.8563124Z 
2025-05-07T20:26:32.8563127Z 
2025-05-07T20:26:32.8564553Z 
2025-05-07T20:26:32.9243845Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########4 |  95% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.9244171Z 
2025-05-07T20:26:32.9244175Z 
2025-05-07T20:26:32.9244179Z 
2025-05-07T20:26:32.9244183Z 
2025-05-07T20:26:32.9244186Z 
2025-05-07T20:26:32.9244192Z 
2025-05-07T20:26:32.9244197Z 
2025-05-07T20:26:32.9244202Z 
2025-05-07T20:26:32.9244207Z 
2025-05-07T20:26:32.9244212Z 
2025-05-07T20:26:32.9244217Z 
2025-05-07T20:26:32.9244223Z 
2025-05-07T20:26:32.9244227Z 
2025-05-07T20:26:32.9244233Z 
2025-05-07T20:26:33.0556688Z cuda-nvvm-impl-12.8. | 20.8 MB   | #########9 |  99% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.0557016Z 
2025-05-07T20:26:33.0557027Z 
2025-05-07T20:26:33.0557031Z 
2025-05-07T20:26:33.0557034Z 
2025-05-07T20:26:33.0557263Z 
2025-05-07T20:26:33.0557268Z 
2025-05-07T20:26:33.0557273Z 
2025-05-07T20:26:33.0557277Z 
2025-05-07T20:26:33.0557280Z 
2025-05-07T20:26:33.0557296Z 
2025-05-07T20:26:33.0557299Z 
2025-05-07T20:26:33.0564674Z 
2025-05-07T20:26:33.1287830Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.1288175Z 
2025-05-07T20:26:33.1288179Z 
2025-05-07T20:26:33.1288183Z 
2025-05-07T20:26:33.1288187Z 
2025-05-07T20:26:33.1288190Z 
2025-05-07T20:26:33.1288194Z 
2025-05-07T20:26:33.1288198Z 
2025-05-07T20:26:33.1288201Z 
2025-05-07T20:26:33.1288205Z 
2025-05-07T20:26:33.1288208Z 
2025-05-07T20:26:33.1288212Z 
2025-05-07T20:26:33.1288216Z 
2025-05-07T20:26:33.1288219Z 
2025-05-07T20:26:33.1288223Z 
2025-05-07T20:26:33.1288227Z 
2025-05-07T20:26:33.2288674Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.2289022Z 
2025-05-07T20:26:33.2289026Z 
2025-05-07T20:26:33.2289030Z 
2025-05-07T20:26:33.2289061Z 
2025-05-07T20:26:33.2289074Z 
2025-05-07T20:26:33.2289078Z 
2025-05-07T20:26:33.2289082Z 
2025-05-07T20:26:33.2289085Z 
2025-05-07T20:26:33.2289373Z 
2025-05-07T20:26:33.2289377Z 
2025-05-07T20:26:33.2289381Z 
2025-05-07T20:26:33.2289385Z 
2025-05-07T20:26:33.2289388Z 
2025-05-07T20:26:33.2289392Z 
2025-05-07T20:26:33.2291793Z 
2025-05-07T20:26:33.3291167Z cuda-nvcc-dev_linux- | 12.7 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.3291510Z 
2025-05-07T20:26:33.3291514Z 
2025-05-07T20:26:33.3291517Z 
2025-05-07T20:26:33.3291521Z 
2025-05-07T20:26:33.3291525Z 
2025-05-07T20:26:33.3291528Z 
2025-05-07T20:26:33.3291532Z 
2025-05-07T20:26:33.3291536Z 
2025-05-07T20:26:33.3291539Z 
2025-05-07T20:26:33.3291543Z 
2025-05-07T20:26:33.3291546Z 
2025-05-07T20:26:33.3291550Z 
2025-05-07T20:26:33.3291554Z 
2025-05-07T20:26:33.3291557Z 
2025-05-07T20:26:33.3292243Z 
2025-05-07T20:26:33.3305096Z cuda-nvcc-dev_linux- | 12.7 MB   | #####4     |  54% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.3305510Z 
2025-05-07T20:26:33.3305514Z 
2025-05-07T20:26:33.3305518Z 
2025-05-07T20:26:33.3305532Z 
2025-05-07T20:26:33.3305544Z 
2025-05-07T20:26:33.3305547Z 
2025-05-07T20:26:33.3305551Z 
2025-05-07T20:26:33.3305554Z 
2025-05-07T20:26:33.3305558Z 
2025-05-07T20:26:33.3305561Z 
2025-05-07T20:26:33.3305565Z 
2025-05-07T20:26:33.3789384Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.3789788Z 
2025-05-07T20:26:33.3789794Z 
2025-05-07T20:26:33.3789799Z 
2025-05-07T20:26:33.3789804Z 
2025-05-07T20:26:33.3789808Z 
2025-05-07T20:26:33.3789813Z 
2025-05-07T20:26:33.3789818Z 
2025-05-07T20:26:33.3789823Z 
2025-05-07T20:26:33.3789828Z 
2025-05-07T20:26:33.3789833Z 
2025-05-07T20:26:33.3789837Z 
2025-05-07T20:26:33.3789842Z 
2025-05-07T20:26:33.3789847Z 
2025-05-07T20:26:33.3789853Z 
2025-05-07T20:26:33.3789858Z 
2025-05-07T20:26:33.3789863Z 
2025-05-07T20:26:33.4292734Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.4293223Z 
2025-05-07T20:26:33.4293229Z 
2025-05-07T20:26:33.4293249Z 
2025-05-07T20:26:33.4293255Z 
2025-05-07T20:26:33.4293260Z 
2025-05-07T20:26:33.4293265Z 
2025-05-07T20:26:33.4293270Z 
2025-05-07T20:26:33.4293275Z 
2025-05-07T20:26:33.4293281Z 
2025-05-07T20:26:33.4293297Z 
2025-05-07T20:26:33.4293302Z 
2025-05-07T20:26:33.4293307Z 
2025-05-07T20:26:33.4293312Z 
2025-05-07T20:26:33.4293318Z 
2025-05-07T20:26:33.4294680Z 
2025-05-07T20:26:33.4793513Z cuda-nvcc-dev_linux- | 12.7 MB   | ########3  |  84% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.4793909Z 
2025-05-07T20:26:33.4793915Z 
2025-05-07T20:26:33.4793921Z 
2025-05-07T20:26:33.4793926Z 
2025-05-07T20:26:33.4793931Z 
2025-05-07T20:26:33.4793937Z 
2025-05-07T20:26:33.4793942Z 
2025-05-07T20:26:33.4793947Z 
2025-05-07T20:26:33.4793952Z 
2025-05-07T20:26:33.4793958Z 
2025-05-07T20:26:33.4793963Z 
2025-05-07T20:26:33.4793968Z 
2025-05-07T20:26:33.4794260Z 
2025-05-07T20:26:33.4794268Z 
2025-05-07T20:26:33.4794273Z 
2025-05-07T20:26:33.4801633Z 
2025-05-07T20:26:33.5890779Z cuda-sanitizer-api-1 | 8.8 MB    | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.5891243Z 
2025-05-07T20:26:33.5891249Z 
2025-05-07T20:26:33.5891254Z 
2025-05-07T20:26:33.5891259Z 
2025-05-07T20:26:33.5891264Z 
2025-05-07T20:26:33.5891269Z 
2025-05-07T20:26:33.5891275Z 
2025-05-07T20:26:33.5891279Z 
2025-05-07T20:26:33.5891284Z 
2025-05-07T20:26:33.5891290Z 
2025-05-07T20:26:33.5891305Z 
2025-05-07T20:26:33.5891310Z 
2025-05-07T20:26:33.5891316Z 
2025-05-07T20:26:33.5891321Z 
2025-05-07T20:26:33.5891326Z 
2025-05-07T20:26:33.5893797Z 
2025-05-07T20:26:33.6592681Z cuda-sanitizer-api-1 | 8.8 MB    | #######7   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.6593036Z 
2025-05-07T20:26:33.6593040Z 
2025-05-07T20:26:33.6593044Z 
2025-05-07T20:26:33.6593047Z 
2025-05-07T20:26:33.6593051Z 
2025-05-07T20:26:33.6593055Z 
2025-05-07T20:26:33.6593072Z 
2025-05-07T20:26:33.6593076Z 
2025-05-07T20:26:33.6593079Z 
2025-05-07T20:26:33.6593083Z 
2025-05-07T20:26:33.6593337Z 
2025-05-07T20:26:33.6593341Z 
2025-05-07T20:26:33.6593344Z 
2025-05-07T20:26:33.6593348Z 
2025-05-07T20:26:33.6935735Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.6936093Z 
2025-05-07T20:26:33.6936099Z 
2025-05-07T20:26:33.6936104Z 
2025-05-07T20:26:33.6936109Z 
2025-05-07T20:26:33.6936114Z 
2025-05-07T20:26:33.6936119Z 
2025-05-07T20:26:33.6936124Z 
2025-05-07T20:26:33.6936130Z 
2025-05-07T20:26:33.6936135Z 
2025-05-07T20:26:33.6936141Z 
2025-05-07T20:26:33.6936155Z 
2025-05-07T20:26:33.6936161Z 
2025-05-07T20:26:33.6936166Z 
2025-05-07T20:26:33.6936171Z 
2025-05-07T20:26:33.6936176Z 
2025-05-07T20:26:33.6936181Z 
2025-05-07T20:26:33.6939071Z 
2025-05-07T20:26:33.7214321Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.7214668Z 
2025-05-07T20:26:33.7214672Z 
2025-05-07T20:26:33.7214676Z 
2025-05-07T20:26:33.7214680Z 
2025-05-07T20:26:33.7214684Z 
2025-05-07T20:26:33.7214695Z 
2025-05-07T20:26:33.7214698Z 
2025-05-07T20:26:33.7214702Z 
2025-05-07T20:26:33.7214706Z 
2025-05-07T20:26:33.7214709Z 
2025-05-07T20:26:33.7214713Z 
2025-05-07T20:26:33.7214716Z 
2025-05-07T20:26:33.7218585Z 
2025-05-07T20:26:33.7936329Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.7936654Z 
2025-05-07T20:26:33.7936658Z 
2025-05-07T20:26:33.7936662Z 
2025-05-07T20:26:33.7936666Z 
2025-05-07T20:26:33.7936669Z 
2025-05-07T20:26:33.7936673Z 
2025-05-07T20:26:33.7936677Z 
2025-05-07T20:26:33.7936683Z 
2025-05-07T20:26:33.7936688Z 
2025-05-07T20:26:33.7936693Z 
2025-05-07T20:26:33.7936709Z 
2025-05-07T20:26:33.7936714Z 
2025-05-07T20:26:33.7936719Z 
2025-05-07T20:26:33.7936724Z 
2025-05-07T20:26:33.7936729Z 
2025-05-07T20:26:33.7936734Z 
2025-05-07T20:26:33.7938178Z 
2025-05-07T20:26:33.7968711Z cuda-nvdisasm-12.8.5 | 4.9 MB    | #######8   |  79% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.7969192Z 
2025-05-07T20:26:33.7969199Z 
2025-05-07T20:26:33.7969204Z 
2025-05-07T20:26:33.7969209Z 
2025-05-07T20:26:33.7969214Z 
2025-05-07T20:26:33.7969221Z 
2025-05-07T20:26:33.7969224Z 
2025-05-07T20:26:33.7969228Z 
2025-05-07T20:26:33.7969232Z 
2025-05-07T20:26:33.7969235Z 
2025-05-07T20:26:33.7969239Z 
2025-05-07T20:26:33.7969253Z 
2025-05-07T20:26:33.7969256Z 
2025-05-07T20:26:33.7969260Z 
2025-05-07T20:26:33.7969263Z 
2025-05-07T20:26:33.7969267Z 
2025-05-07T20:26:33.7969270Z 
2025-05-07T20:26:33.7969274Z 
2025-05-07T20:26:33.8971446Z cuda-cupti-dev-12.8. | 4.0 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.8971870Z 
2025-05-07T20:26:33.8971874Z 
2025-05-07T20:26:33.8971877Z 
2025-05-07T20:26:33.8971881Z 
2025-05-07T20:26:33.8971885Z 
2025-05-07T20:26:33.8971888Z 
2025-05-07T20:26:33.8971892Z 
2025-05-07T20:26:33.8972157Z 
2025-05-07T20:26:33.8972165Z 
2025-05-07T20:26:33.8972170Z 
2025-05-07T20:26:33.8972175Z 
2025-05-07T20:26:33.8972180Z 
2025-05-07T20:26:33.8972194Z 
2025-05-07T20:26:33.8972200Z 
2025-05-07T20:26:33.8972205Z 
2025-05-07T20:26:33.8972210Z 
2025-05-07T20:26:33.8972215Z 
2025-05-07T20:26:33.8974948Z 
2025-05-07T20:26:33.9192110Z cuda-cupti-dev-12.8. | 4.0 MB    | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.9192525Z 
2025-05-07T20:26:33.9192532Z 
2025-05-07T20:26:33.9192537Z 
2025-05-07T20:26:33.9192542Z 
2025-05-07T20:26:33.9192555Z 
2025-05-07T20:26:33.9192560Z 
2025-05-07T20:26:33.9192565Z 
2025-05-07T20:26:33.9192570Z 
2025-05-07T20:26:33.9192574Z 
2025-05-07T20:26:33.9192577Z 
2025-05-07T20:26:33.9192581Z 
2025-05-07T20:26:33.9192584Z 
2025-05-07T20:26:33.9192588Z 
2025-05-07T20:26:33.9192591Z 
2025-05-07T20:26:33.9195009Z 
2025-05-07T20:26:33.9624654Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.9625088Z 
2025-05-07T20:26:33.9625095Z 
2025-05-07T20:26:33.9625100Z 
2025-05-07T20:26:33.9625105Z 
2025-05-07T20:26:33.9625353Z 
2025-05-07T20:26:33.9625357Z 
2025-05-07T20:26:33.9625360Z 
2025-05-07T20:26:33.9625364Z 
2025-05-07T20:26:33.9625368Z 
2025-05-07T20:26:33.9625371Z 
2025-05-07T20:26:33.9625375Z 
2025-05-07T20:26:33.9625379Z 
2025-05-07T20:26:33.9625382Z 
2025-05-07T20:26:33.9625386Z 
2025-05-07T20:26:33.9625390Z 
2025-05-07T20:26:33.9626918Z 
2025-05-07T20:26:33.9969104Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.9969514Z 
2025-05-07T20:26:33.9969519Z 
2025-05-07T20:26:33.9969532Z 
2025-05-07T20:26:33.9969537Z 
2025-05-07T20:26:33.9969542Z 
2025-05-07T20:26:33.9969547Z 
2025-05-07T20:26:33.9969552Z 
2025-05-07T20:26:33.9969557Z 
2025-05-07T20:26:33.9969562Z 
2025-05-07T20:26:33.9969567Z 
2025-05-07T20:26:33.9969572Z 
2025-05-07T20:26:33.9969577Z 
2025-05-07T20:26:33.9969583Z 
2025-05-07T20:26:33.9969603Z 
2025-05-07T20:26:33.9969609Z 
2025-05-07T20:26:33.9969615Z 
2025-05-07T20:26:33.9969618Z 
2025-05-07T20:26:34.0010588Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.0011019Z 
2025-05-07T20:26:34.0011023Z 
2025-05-07T20:26:34.0011027Z 
2025-05-07T20:26:34.0011031Z 
2025-05-07T20:26:34.0011035Z 
2025-05-07T20:26:34.0011038Z 
2025-05-07T20:26:34.0011042Z 
2025-05-07T20:26:34.0011046Z 
2025-05-07T20:26:34.0011049Z 
2025-05-07T20:26:34.0011053Z 
2025-05-07T20:26:34.0011057Z 
2025-05-07T20:26:34.0011060Z 
2025-05-07T20:26:34.0011070Z 
2025-05-07T20:26:34.0011074Z 
2025-05-07T20:26:34.0011077Z 
2025-05-07T20:26:34.0011081Z 
2025-05-07T20:26:34.0011085Z 
2025-05-07T20:26:34.0011088Z 
2025-05-07T20:26:34.0011092Z 
2025-05-07T20:26:34.0555181Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.0555581Z 
2025-05-07T20:26:34.0555585Z 
2025-05-07T20:26:34.0555589Z 
2025-05-07T20:26:34.0555608Z 
2025-05-07T20:26:34.0555612Z 
2025-05-07T20:26:34.0555616Z 
2025-05-07T20:26:34.0555619Z 
2025-05-07T20:26:34.0555623Z 
2025-05-07T20:26:34.0555636Z 
2025-05-07T20:26:34.0555640Z 
2025-05-07T20:26:34.0555644Z 
2025-05-07T20:26:34.0555647Z 
2025-05-07T20:26:34.0555651Z 
2025-05-07T20:26:34.0555654Z 
2025-05-07T20:26:34.0555658Z 
2025-05-07T20:26:34.0555661Z 
2025-05-07T20:26:34.0555665Z 
2025-05-07T20:26:34.0560071Z 
2025-05-07T20:26:34.1012604Z cuda-cupti-dev-12.8. | 4.0 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.1012949Z 
2025-05-07T20:26:34.1012955Z 
2025-05-07T20:26:34.1012960Z 
2025-05-07T20:26:34.1012965Z 
2025-05-07T20:26:34.1012970Z 
2025-05-07T20:26:34.1012975Z 
2025-05-07T20:26:34.1012981Z 
2025-05-07T20:26:34.1012986Z 
2025-05-07T20:26:34.1013000Z 
2025-05-07T20:26:34.1013005Z 
2025-05-07T20:26:34.1013010Z 
2025-05-07T20:26:34.1013015Z 
2025-05-07T20:26:34.1013020Z 
2025-05-07T20:26:34.1013025Z 
2025-05-07T20:26:34.1013030Z 
2025-05-07T20:26:34.1013290Z 
2025-05-07T20:26:34.1013298Z 
2025-05-07T20:26:34.1013302Z 
2025-05-07T20:26:34.1013306Z 
2025-05-07T20:26:34.2596800Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.2597170Z 
2025-05-07T20:26:34.2597174Z 
2025-05-07T20:26:34.2597178Z 
2025-05-07T20:26:34.2597181Z 
2025-05-07T20:26:34.2597185Z 
2025-05-07T20:26:34.2597189Z 
2025-05-07T20:26:34.2597192Z 
2025-05-07T20:26:34.2597196Z 
2025-05-07T20:26:34.2597199Z 
2025-05-07T20:26:34.2597203Z 
2025-05-07T20:26:34.2597207Z 
2025-05-07T20:26:34.2597210Z 
2025-05-07T20:26:34.2597214Z 
2025-05-07T20:26:34.2597218Z 
2025-05-07T20:26:34.2597221Z 
2025-05-07T20:26:34.2597225Z 
2025-05-07T20:26:34.2597237Z 
2025-05-07T20:26:34.2597241Z 
2025-05-07T20:26:34.2597971Z 
2025-05-07T20:26:34.9644172Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.9644490Z 
2025-05-07T20:26:34.9644494Z 
2025-05-07T20:26:34.9644497Z 
2025-05-07T20:26:34.9644502Z 
2025-05-07T20:26:34.9644535Z 
2025-05-07T20:26:34.9644539Z 
2025-05-07T20:26:35.0954590Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:35.0955142Z 
2025-05-07T20:26:35.0955147Z 
2025-05-07T20:26:35.0955150Z 
2025-05-07T20:26:35.0955154Z 
2025-05-07T20:26:35.0955157Z 
2025-05-07T20:26:35.0955161Z 
2025-05-07T20:26:35.0955738Z 
2025-05-07T20:26:35.6477980Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:35.6478383Z 
2025-05-07T20:26:35.6478388Z 
2025-05-07T20:26:35.6478391Z 
2025-05-07T20:26:35.6478395Z 
2025-05-07T20:26:35.6478399Z 
2025-05-07T20:26:35.6478402Z 
2025-05-07T20:26:35.6478406Z 
2025-05-07T20:26:35.6478410Z 
2025-05-07T20:26:35.6478413Z 
2025-05-07T20:26:35.6478417Z 
2025-05-07T20:26:36.2290263Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:36.2614606Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:36.2615002Z 
2025-05-07T20:26:36.2615007Z 
2025-05-07T20:26:36.2615053Z 
2025-05-07T20:26:36.2615059Z 
2025-05-07T20:26:36.2615064Z 
2025-05-07T20:26:36.2615069Z 
2025-05-07T20:26:36.2615091Z 
2025-05-07T20:26:36.2615096Z 
2025-05-07T20:26:36.2615101Z 
2025-05-07T20:26:36.6271055Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:36.6271542Z 
2025-05-07T20:26:36.6271549Z 
2025-05-07T20:26:36.6271554Z 
2025-05-07T20:26:36.6271560Z 
2025-05-07T20:26:36.6271565Z 
2025-05-07T20:26:36.9864936Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:36.9865607Z 
2025-05-07T20:26:36.9865616Z 
2025-05-07T20:26:36.9865623Z 
2025-05-07T20:26:36.9865629Z 
2025-05-07T20:26:36.9865638Z 
2025-05-07T20:26:36.9865647Z 
2025-05-07T20:26:36.9865655Z 
2025-05-07T20:26:36.9865664Z 
2025-05-07T20:26:36.9865673Z 
2025-05-07T20:26:36.9865682Z 
2025-05-07T20:26:36.9865691Z 
2025-05-07T20:26:36.9865699Z 
2025-05-07T20:26:37.1623538Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.1624033Z 
2025-05-07T20:26:37.1624042Z 
2025-05-07T20:26:37.1624049Z 
2025-05-07T20:26:37.1624055Z 
2025-05-07T20:26:37.1624081Z 
2025-05-07T20:26:37.1624087Z 
2025-05-07T20:26:37.1624094Z 
2025-05-07T20:26:37.1624100Z 
2025-05-07T20:26:37.3796667Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:37.3797139Z 
2025-05-07T20:26:37.3797146Z 
2025-05-07T20:26:37.3797152Z 
2025-05-07T20:26:37.3797157Z 
2025-05-07T20:26:37.3797163Z 
2025-05-07T20:26:37.3797183Z 
2025-05-07T20:26:37.3797189Z 
2025-05-07T20:26:37.3797194Z 
2025-05-07T20:26:37.3797200Z 
2025-05-07T20:26:37.3797205Z 
2025-05-07T20:26:37.3797211Z 
2025-05-07T20:26:37.4612777Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.4613264Z 
2025-05-07T20:26:37.4613270Z 
2025-05-07T20:26:37.4613276Z 
2025-05-07T20:26:37.4613281Z 
2025-05-07T20:26:37.4613287Z 
2025-05-07T20:26:37.4613293Z 
2025-05-07T20:26:37.4613582Z 
2025-05-07T20:26:37.4613590Z 
2025-05-07T20:26:37.4613594Z 
2025-05-07T20:26:37.4613599Z 
2025-05-07T20:26:37.4613604Z 
2025-05-07T20:26:37.4613626Z 
2025-05-07T20:26:37.4613631Z 
2025-05-07T20:26:37.4613636Z 
2025-05-07T20:26:37.7352317Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.7352805Z 
2025-05-07T20:26:37.7352812Z 
2025-05-07T20:26:37.7352818Z 
2025-05-07T20:26:37.7352822Z 
2025-05-07T20:26:37.7352827Z 
2025-05-07T20:26:37.7352842Z 
2025-05-07T20:26:37.7352847Z 
2025-05-07T20:26:37.7352852Z 
2025-05-07T20:26:37.7352857Z 
2025-05-07T20:26:37.7352862Z 
2025-05-07T20:26:37.7352867Z 
2025-05-07T20:26:37.7352872Z 
2025-05-07T20:26:37.7352877Z 
2025-05-07T20:26:37.7352883Z 
2025-05-07T20:26:37.7352888Z 
2025-05-07T20:26:37.7491472Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.7491972Z 
2025-05-07T20:26:37.7491979Z 
2025-05-07T20:26:37.7491984Z 
2025-05-07T20:26:37.7492024Z 
2025-05-07T20:26:37.7492030Z 
2025-05-07T20:26:37.7492035Z 
2025-05-07T20:26:37.7492040Z 
2025-05-07T20:26:37.7492045Z 
2025-05-07T20:26:37.7492360Z 
2025-05-07T20:26:37.7492365Z 
2025-05-07T20:26:37.7492370Z 
2025-05-07T20:26:37.7492375Z 
2025-05-07T20:26:37.7492380Z 
2025-05-07T20:26:37.7980288Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.7980717Z 
2025-05-07T20:26:37.7980723Z 
2025-05-07T20:26:37.7980728Z 
2025-05-07T20:26:37.7980733Z 
2025-05-07T20:26:37.7980738Z 
2025-05-07T20:26:37.7980748Z 
2025-05-07T20:26:37.7980754Z 
2025-05-07T20:26:37.7980759Z 
2025-05-07T20:26:37.7980776Z 
2025-05-07T20:26:37.7980782Z 
2025-05-07T20:26:37.7980787Z 
2025-05-07T20:26:37.7980792Z 
2025-05-07T20:26:37.7980797Z 
2025-05-07T20:26:37.7980804Z 
2025-05-07T20:26:37.7980807Z 
2025-05-07T20:26:37.7980811Z 
2025-05-07T20:26:37.7980814Z 
2025-05-07T20:26:37.9769365Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.9769898Z 
2025-05-07T20:26:37.9769905Z 
2025-05-07T20:26:37.9769910Z 
2025-05-07T20:26:37.9769915Z 
2025-05-07T20:26:37.9769937Z 
2025-05-07T20:26:37.9769943Z 
2025-05-07T20:26:37.9769948Z 
2025-05-07T20:26:37.9769953Z 
2025-05-07T20:26:37.9769958Z 
2025-05-07T20:26:37.9769963Z 
2025-05-07T20:26:37.9769968Z 
2025-05-07T20:26:37.9769973Z 
2025-05-07T20:26:37.9769979Z 
2025-05-07T20:26:37.9769984Z 
2025-05-07T20:26:37.9769990Z 
2025-05-07T20:26:37.9769995Z 
2025-05-07T20:26:37.9960815Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.9961288Z 
2025-05-07T20:26:37.9961295Z 
2025-05-07T20:26:37.9961300Z 
2025-05-07T20:26:37.9961306Z 
2025-05-07T20:26:37.9961311Z 
2025-05-07T20:26:37.9961316Z 
2025-05-07T20:26:37.9961332Z 
2025-05-07T20:26:37.9961337Z 
2025-05-07T20:26:37.9961342Z 
2025-05-07T20:26:37.9961347Z 
2025-05-07T20:26:37.9961352Z 
2025-05-07T20:26:37.9961358Z 
2025-05-07T20:26:37.9961363Z 
2025-05-07T20:26:37.9961395Z 
2025-05-07T20:26:37.9961401Z 
2025-05-07T20:26:37.9961406Z 
2025-05-07T20:26:37.9961411Z 
2025-05-07T20:26:37.9961429Z 
2025-05-07T20:26:38.3866642Z cuda-cupti-dev-12.8. | 4.0 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:38.3867141Z 
2025-05-07T20:26:38.3867148Z 
2025-05-07T20:26:38.3867153Z 
2025-05-07T20:26:38.3867158Z 
2025-05-07T20:26:38.3867163Z 
2025-05-07T20:26:38.3867189Z 
2025-05-07T20:26:38.3867194Z 
2025-05-07T20:26:38.3867199Z 
2025-05-07T20:26:38.3867204Z 
2025-05-07T20:26:38.3867209Z 
2025-05-07T20:26:38.3867214Z 
2025-05-07T20:26:38.3867219Z 
2025-05-07T20:26:38.3867225Z 
2025-05-07T20:26:38.3867230Z 
2025-05-07T20:26:38.3867235Z 
2025-05-07T20:26:38.3867240Z 
2025-05-07T20:26:38.3867245Z 
2025-05-07T20:26:38.3867250Z 
2025-05-07T20:26:38.3867255Z 
2025-05-07T20:26:42.6476292Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.6476884Z 
2025-05-07T20:26:43.8188640Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:43.8196851Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:43.8197246Z 
2025-05-07T20:26:43.8197251Z 
2025-05-07T20:26:43.8197255Z 
2025-05-07T20:26:43.8197258Z 
2025-05-07T20:26:43.8197262Z 
2025-05-07T20:26:43.8197266Z 
2025-05-07T20:26:43.8197269Z 
2025-05-07T20:26:43.8197273Z 
2025-05-07T20:26:43.8197276Z 
2025-05-07T20:26:43.8197280Z 
2025-05-07T20:26:43.8197284Z 
2025-05-07T20:26:43.8197291Z 
2025-05-07T20:26:43.8197296Z 
2025-05-07T20:26:43.8197300Z 
2025-05-07T20:26:43.8197305Z 
2025-05-07T20:26:43.8197310Z 
2025-05-07T20:26:43.8197315Z 
2025-05-07T20:26:43.8197321Z 
2025-05-07T20:26:43.8197326Z 
2025-05-07T20:26:43.8197462Z                       
2025-05-07T20:26:43.8197867Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8198318Z                                                      
2025-05-07T20:26:43.8198561Z 
2025-05-07T20:26:43.8198831Z                                                      [A
2025-05-07T20:26:43.8199040Z 
2025-05-07T20:26:43.8199045Z 
2025-05-07T20:26:43.8199292Z                                                      [A[A
2025-05-07T20:26:43.8199777Z 
2025-05-07T20:26:43.8199783Z 
2025-05-07T20:26:43.8199788Z 
2025-05-07T20:26:43.8199996Z                                                      [A[A[A
2025-05-07T20:26:43.8200234Z 
2025-05-07T20:26:43.8200240Z 
2025-05-07T20:26:43.8200245Z 
2025-05-07T20:26:43.8200250Z 
2025-05-07T20:26:43.8200516Z                                                      [A[A[A[A
2025-05-07T20:26:43.8200747Z 
2025-05-07T20:26:43.8200753Z 
2025-05-07T20:26:43.8200757Z 
2025-05-07T20:26:43.8200763Z 
2025-05-07T20:26:43.8200768Z 
2025-05-07T20:26:43.8201011Z                                                      [A[A[A[A[A
2025-05-07T20:26:43.8201224Z 
2025-05-07T20:26:43.8201228Z 
2025-05-07T20:26:43.8201234Z 
2025-05-07T20:26:43.8201239Z 
2025-05-07T20:26:43.8201244Z 
2025-05-07T20:26:43.8201250Z 
2025-05-07T20:26:43.8201521Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:43.8201738Z 
2025-05-07T20:26:43.8201741Z 
2025-05-07T20:26:43.8201752Z 
2025-05-07T20:26:43.8201756Z 
2025-05-07T20:26:43.8201759Z 
2025-05-07T20:26:43.8201763Z 
2025-05-07T20:26:43.8201766Z 
2025-05-07T20:26:43.8202041Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:43.8202317Z 
2025-05-07T20:26:43.8202323Z 
2025-05-07T20:26:43.8202328Z 
2025-05-07T20:26:43.8202332Z 
2025-05-07T20:26:43.8202337Z 
2025-05-07T20:26:43.8202342Z 
2025-05-07T20:26:43.8202347Z 
2025-05-07T20:26:43.8202352Z 
2025-05-07T20:26:43.8202630Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8202851Z 
2025-05-07T20:26:43.8202855Z 
2025-05-07T20:26:43.8202858Z 
2025-05-07T20:26:43.8202862Z 
2025-05-07T20:26:43.8202865Z 
2025-05-07T20:26:43.8202869Z 
2025-05-07T20:26:43.8202872Z 
2025-05-07T20:26:43.8202876Z 
2025-05-07T20:26:43.8202879Z 
2025-05-07T20:26:43.8203085Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8203304Z 
2025-05-07T20:26:43.8203312Z 
2025-05-07T20:26:43.8203315Z 
2025-05-07T20:26:43.8203319Z 
2025-05-07T20:26:43.8203322Z 
2025-05-07T20:26:43.8203326Z 
2025-05-07T20:26:43.8203335Z 
2025-05-07T20:26:43.8203339Z 
2025-05-07T20:26:43.8203342Z 
2025-05-07T20:26:43.8203346Z 
2025-05-07T20:26:43.8203591Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8203837Z 
2025-05-07T20:26:43.8203848Z 
2025-05-07T20:26:43.8203852Z 
2025-05-07T20:26:43.8203855Z 
2025-05-07T20:26:43.8203859Z 
2025-05-07T20:26:43.8203862Z 
2025-05-07T20:26:43.8203866Z 
2025-05-07T20:26:43.8203869Z 
2025-05-07T20:26:43.8203873Z 
2025-05-07T20:26:43.8203876Z 
2025-05-07T20:26:43.8203880Z 
2025-05-07T20:26:43.8204081Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8204370Z 
2025-05-07T20:26:43.8204373Z 
2025-05-07T20:26:43.8204490Z 
2025-05-07T20:26:43.8204495Z 
2025-05-07T20:26:43.8204498Z 
2025-05-07T20:26:43.8204502Z 
2025-05-07T20:26:43.8204505Z 
2025-05-07T20:26:43.8204523Z 
2025-05-07T20:26:43.8204526Z 
2025-05-07T20:26:43.8204530Z 
2025-05-07T20:26:43.8204533Z 
2025-05-07T20:26:43.8204537Z 
2025-05-07T20:26:43.8204761Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8204986Z 
2025-05-07T20:26:43.8204990Z 
2025-05-07T20:26:43.8204993Z 
2025-05-07T20:26:43.8204997Z 
2025-05-07T20:26:43.8205000Z 
2025-05-07T20:26:43.8205004Z 
2025-05-07T20:26:43.8205007Z 
2025-05-07T20:26:43.8205011Z 
2025-05-07T20:26:43.8205014Z 
2025-05-07T20:26:43.8205018Z 
2025-05-07T20:26:43.8205021Z 
2025-05-07T20:26:43.8205025Z 
2025-05-07T20:26:43.8205036Z 
2025-05-07T20:26:43.8205240Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8205467Z 
2025-05-07T20:26:43.8205470Z 
2025-05-07T20:26:43.8205474Z 
2025-05-07T20:26:43.8205483Z 
2025-05-07T20:26:43.8205493Z 
2025-05-07T20:26:43.8205497Z 
2025-05-07T20:26:43.8205501Z 
2025-05-07T20:26:43.8205504Z 
2025-05-07T20:26:43.8205586Z 
2025-05-07T20:26:43.8205590Z 
2025-05-07T20:26:43.8205594Z 
2025-05-07T20:26:43.8205597Z 
2025-05-07T20:26:43.8205601Z 
2025-05-07T20:26:43.8205604Z 
2025-05-07T20:26:43.8205819Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8206056Z 
2025-05-07T20:26:43.8206059Z 
2025-05-07T20:26:43.8206063Z 
2025-05-07T20:26:43.8206066Z 
2025-05-07T20:26:43.8206070Z 
2025-05-07T20:26:43.8206074Z 
2025-05-07T20:26:43.8206077Z 
2025-05-07T20:26:43.8206081Z 
2025-05-07T20:26:43.8206084Z 
2025-05-07T20:26:43.8206088Z 
2025-05-07T20:26:43.8206091Z 
2025-05-07T20:26:43.8206095Z 
2025-05-07T20:26:43.8206099Z 
2025-05-07T20:26:43.8206102Z 
2025-05-07T20:26:43.8206106Z 
2025-05-07T20:26:43.8206325Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8206564Z 
2025-05-07T20:26:43.8206567Z 
2025-05-07T20:26:43.8206571Z 
2025-05-07T20:26:43.8206574Z 
2025-05-07T20:26:43.8206578Z 
2025-05-07T20:26:43.8206587Z 
2025-05-07T20:26:43.8206590Z 
2025-05-07T20:26:43.8206594Z 
2025-05-07T20:26:43.8206604Z 
2025-05-07T20:26:43.8206608Z 
2025-05-07T20:26:43.8206611Z 
2025-05-07T20:26:43.8206615Z 
2025-05-07T20:26:43.8206618Z 
2025-05-07T20:26:43.8206622Z 
2025-05-07T20:26:43.8206625Z 
2025-05-07T20:26:43.8206629Z 
2025-05-07T20:26:43.8207194Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8207469Z 
2025-05-07T20:26:43.8207473Z 
2025-05-07T20:26:43.8207477Z 
2025-05-07T20:26:43.8207480Z 
2025-05-07T20:26:43.8207484Z 
2025-05-07T20:26:43.8207487Z 
2025-05-07T20:26:43.8207491Z 
2025-05-07T20:26:43.8207495Z 
2025-05-07T20:26:43.8207498Z 
2025-05-07T20:26:43.8207502Z 
2025-05-07T20:26:43.8207505Z 
2025-05-07T20:26:43.8207509Z 
2025-05-07T20:26:43.8207513Z 
2025-05-07T20:26:43.8207516Z 
2025-05-07T20:26:43.8207530Z 
2025-05-07T20:26:43.8207539Z 
2025-05-07T20:26:43.8207543Z 
2025-05-07T20:26:43.8207876Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8208144Z 
2025-05-07T20:26:43.8208148Z 
2025-05-07T20:26:43.8208152Z 
2025-05-07T20:26:43.8208155Z 
2025-05-07T20:26:43.8208164Z 
2025-05-07T20:26:43.8208174Z 
2025-05-07T20:26:43.8208177Z 
2025-05-07T20:26:43.8208181Z 
2025-05-07T20:26:43.8208184Z 
2025-05-07T20:26:43.8208188Z 
2025-05-07T20:26:43.8208191Z 
2025-05-07T20:26:43.8208195Z 
2025-05-07T20:26:43.8208198Z 
2025-05-07T20:26:43.8208202Z 
2025-05-07T20:26:43.8208205Z 
2025-05-07T20:26:43.8208209Z 
2025-05-07T20:26:43.8208212Z 
2025-05-07T20:26:43.8208216Z 
2025-05-07T20:26:43.8208929Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8209175Z 
2025-05-07T20:26:43.8209179Z 
2025-05-07T20:26:43.8209320Z [A
2025-05-07T20:26:43.8209472Z 
2025-05-07T20:26:43.8209973Z 
2025-05-07T20:26:43.8210101Z [A[A
2025-05-07T20:26:43.8210215Z 
2025-05-07T20:26:43.8210219Z 
2025-05-07T20:26:43.8210235Z 
2025-05-07T20:26:43.8210346Z [A[A[A
2025-05-07T20:26:43.8210456Z 
2025-05-07T20:26:43.8210460Z 
2025-05-07T20:26:43.8210464Z 
2025-05-07T20:26:43.8210467Z 
2025-05-07T20:26:43.8210823Z [A[A[A[A
2025-05-07T20:26:43.8210992Z 
2025-05-07T20:26:43.8210997Z 
2025-05-07T20:26:43.8211007Z 
2025-05-07T20:26:43.8211012Z 
2025-05-07T20:26:43.8211015Z 
2025-05-07T20:26:43.8211393Z [A[A[A[A[A
2025-05-07T20:26:43.8211546Z 
2025-05-07T20:26:43.8211550Z 
2025-05-07T20:26:43.8211554Z 
2025-05-07T20:26:43.8211557Z 
2025-05-07T20:26:43.8211561Z 
2025-05-07T20:26:43.8211565Z 
2025-05-07T20:26:43.8211839Z [A[A[A[A[A[A
2025-05-07T20:26:43.8211972Z 
2025-05-07T20:26:43.8211980Z 
2025-05-07T20:26:43.8211983Z 
2025-05-07T20:26:43.8211987Z 
2025-05-07T20:26:43.8211991Z 
2025-05-07T20:26:43.8211994Z 
2025-05-07T20:26:43.8211998Z 
2025-05-07T20:26:43.8212414Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8212558Z 
2025-05-07T20:26:43.8212561Z 
2025-05-07T20:26:43.8212565Z 
2025-05-07T20:26:43.8212573Z 
2025-05-07T20:26:43.8212691Z 
2025-05-07T20:26:43.8212694Z 
2025-05-07T20:26:43.8212698Z 
2025-05-07T20:26:43.8212701Z 
2025-05-07T20:26:43.8212929Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8213125Z 
2025-05-07T20:26:43.8213128Z 
2025-05-07T20:26:43.8213132Z 
2025-05-07T20:26:43.8213135Z 
2025-05-07T20:26:43.8213143Z 
2025-05-07T20:26:43.8213147Z 
2025-05-07T20:26:43.8213158Z 
2025-05-07T20:26:43.8213161Z 
2025-05-07T20:26:43.8213165Z 
2025-05-07T20:26:43.8213421Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8213585Z 
2025-05-07T20:26:43.8213593Z 
2025-05-07T20:26:43.8213597Z 
2025-05-07T20:26:43.8213601Z 
2025-05-07T20:26:43.8213604Z 
2025-05-07T20:26:43.8213608Z 
2025-05-07T20:26:43.8213611Z 
2025-05-07T20:26:43.8213615Z 
2025-05-07T20:26:43.8213618Z 
2025-05-07T20:26:43.8213622Z 
2025-05-07T20:26:43.8213985Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8214183Z 
2025-05-07T20:26:43.8214187Z 
2025-05-07T20:26:43.8214203Z 
2025-05-07T20:26:43.8214206Z 
2025-05-07T20:26:43.8214210Z 
2025-05-07T20:26:43.8214220Z 
2025-05-07T20:26:43.8214224Z 
2025-05-07T20:26:43.8214227Z 
2025-05-07T20:26:43.8214231Z 
2025-05-07T20:26:43.8214234Z 
2025-05-07T20:26:43.8214238Z 
2025-05-07T20:26:43.8214431Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8214613Z 
2025-05-07T20:26:43.8214621Z 
2025-05-07T20:26:43.8214625Z 
2025-05-07T20:26:43.8214629Z 
2025-05-07T20:26:43.8214632Z 
2025-05-07T20:26:43.8214636Z 
2025-05-07T20:26:43.8214639Z 
2025-05-07T20:26:43.8214643Z 
2025-05-07T20:26:43.8214646Z 
2025-05-07T20:26:43.8214650Z 
2025-05-07T20:26:43.8214654Z 
2025-05-07T20:26:43.8214657Z 
2025-05-07T20:26:43.8214926Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8215108Z 
2025-05-07T20:26:43.8215116Z 
2025-05-07T20:26:43.8215120Z 
2025-05-07T20:26:43.8215123Z 
2025-05-07T20:26:43.8215127Z 
2025-05-07T20:26:43.8215131Z 
2025-05-07T20:26:43.8215134Z 
2025-05-07T20:26:43.8215147Z 
2025-05-07T20:26:43.8215157Z 
2025-05-07T20:26:43.8215161Z 
2025-05-07T20:26:43.8215164Z 
2025-05-07T20:26:43.8215168Z 
2025-05-07T20:26:43.8215177Z 
2025-05-07T20:26:43.8215415Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8215598Z 
2025-05-07T20:26:43.8215615Z 
2025-05-07T20:26:43.8215618Z 
2025-05-07T20:26:43.8215622Z 
2025-05-07T20:26:43.8215625Z 
2025-05-07T20:26:43.8215629Z 
2025-05-07T20:26:43.8215632Z 
2025-05-07T20:26:43.8215636Z 
2025-05-07T20:26:43.8215639Z 
2025-05-07T20:26:43.8215643Z 
2025-05-07T20:26:43.8215646Z 
2025-05-07T20:26:43.8215650Z 
2025-05-07T20:26:43.8215653Z 
2025-05-07T20:26:43.8215657Z 
2025-05-07T20:26:43.8216020Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8216306Z 
2025-05-07T20:26:43.8216311Z 
2025-05-07T20:26:43.8216316Z 
2025-05-07T20:26:43.8216321Z 
2025-05-07T20:26:43.8216333Z 
2025-05-07T20:26:43.8216338Z 
2025-05-07T20:26:43.8216343Z 
2025-05-07T20:26:43.8216348Z 
2025-05-07T20:26:43.8216353Z 
2025-05-07T20:26:43.8216358Z 
2025-05-07T20:26:43.8216488Z 
2025-05-07T20:26:43.8216494Z 
2025-05-07T20:26:43.8216500Z 
2025-05-07T20:26:43.8216505Z 
2025-05-07T20:26:43.8216516Z 
2025-05-07T20:26:43.8216752Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8217022Z 
2025-05-07T20:26:43.8217027Z 
2025-05-07T20:26:43.8217032Z 
2025-05-07T20:26:43.8217038Z 
2025-05-07T20:26:43.8217043Z 
2025-05-07T20:26:43.8217049Z 
2025-05-07T20:26:43.8217054Z 
2025-05-07T20:26:43.8217059Z 
2025-05-07T20:26:43.8217064Z 
2025-05-07T20:26:43.8217078Z 
2025-05-07T20:26:43.8217084Z 
2025-05-07T20:26:43.8217089Z 
2025-05-07T20:26:43.8217094Z 
2025-05-07T20:26:43.8217099Z 
2025-05-07T20:26:43.8217104Z 
2025-05-07T20:26:43.8217109Z 
2025-05-07T20:26:43.8217325Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8217637Z 
2025-05-07T20:26:43.8217642Z 
2025-05-07T20:26:43.8217647Z 
2025-05-07T20:26:43.8217652Z 
2025-05-07T20:26:43.8217658Z 
2025-05-07T20:26:43.8217664Z 
2025-05-07T20:26:43.8217671Z 
2025-05-07T20:26:43.8217687Z 
2025-05-07T20:26:43.8217703Z 
2025-05-07T20:26:43.8217709Z 
2025-05-07T20:26:43.8217715Z 
2025-05-07T20:26:43.8217721Z 
2025-05-07T20:26:43.8217877Z 
2025-05-07T20:26:43.8217882Z 
2025-05-07T20:26:43.8217887Z 
2025-05-07T20:26:43.8217892Z 
2025-05-07T20:26:43.8217897Z 
2025-05-07T20:26:43.8218123Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8218405Z 
2025-05-07T20:26:43.8218410Z 
2025-05-07T20:26:43.8218415Z 
2025-05-07T20:26:43.8218419Z 
2025-05-07T20:26:43.8218425Z 
2025-05-07T20:26:43.8218436Z 
2025-05-07T20:26:43.8218441Z 
2025-05-07T20:26:43.8218446Z 
2025-05-07T20:26:43.8218451Z 
2025-05-07T20:26:43.8218456Z 
2025-05-07T20:26:43.8218468Z 
2025-05-07T20:26:43.8218473Z 
2025-05-07T20:26:43.8218478Z 
2025-05-07T20:26:43.8218483Z 
2025-05-07T20:26:43.8218488Z 
2025-05-07T20:26:43.8218492Z 
2025-05-07T20:26:43.8218497Z 
2025-05-07T20:26:43.8218502Z 
2025-05-07T20:26:43.8219157Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8219468Z 
2025-05-07T20:26:43.8219483Z 
2025-05-07T20:26:43.8219638Z [A
2025-05-07T20:26:43.8219782Z 
2025-05-07T20:26:43.8219787Z 
2025-05-07T20:26:43.8219941Z [A[A
2025-05-07T20:26:43.8220098Z 
2025-05-07T20:26:43.8220103Z 
2025-05-07T20:26:43.8220108Z 
2025-05-07T20:26:43.8220635Z [A[A[A
2025-05-07T20:26:43.8220782Z 
2025-05-07T20:26:43.8220786Z 
2025-05-07T20:26:43.8220790Z 
2025-05-07T20:26:43.8220793Z 
2025-05-07T20:26:43.8221152Z [A[A[A[A
2025-05-07T20:26:43.8221305Z 
2025-05-07T20:26:43.8221309Z 
2025-05-07T20:26:43.8221316Z 
2025-05-07T20:26:43.8221319Z 
2025-05-07T20:26:43.8221323Z 
2025-05-07T20:26:43.8221731Z [A[A[A[A[A
2025-05-07T20:26:43.8221890Z 
2025-05-07T20:26:43.8221893Z 
2025-05-07T20:26:43.8221901Z 
2025-05-07T20:26:43.8221904Z 
2025-05-07T20:26:43.8221908Z 
2025-05-07T20:26:43.8221911Z 
2025-05-07T20:26:43.8222307Z [A[A[A[A[A[A
2025-05-07T20:26:43.8222451Z 
2025-05-07T20:26:43.8222454Z 
2025-05-07T20:26:43.8222458Z 
2025-05-07T20:26:43.8222464Z 
2025-05-07T20:26:43.8222468Z 
2025-05-07T20:26:43.8222485Z 
2025-05-07T20:26:43.8222489Z 
2025-05-07T20:26:43.8222923Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8223119Z 
2025-05-07T20:26:43.8223132Z 
2025-05-07T20:26:43.8223137Z 
2025-05-07T20:26:43.8223149Z 
2025-05-07T20:26:43.8223154Z 
2025-05-07T20:26:43.8223159Z 
2025-05-07T20:26:43.8223163Z 
2025-05-07T20:26:43.8223169Z 
2025-05-07T20:26:43.8223470Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8223633Z 
2025-05-07T20:26:43.8223636Z 
2025-05-07T20:26:43.8223640Z 
2025-05-07T20:26:43.8223644Z 
2025-05-07T20:26:43.8223647Z 
2025-05-07T20:26:43.8223654Z 
2025-05-07T20:26:43.8223657Z 
2025-05-07T20:26:43.8223661Z 
2025-05-07T20:26:43.8223664Z 
2025-05-07T20:26:43.8224050Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8224226Z 
2025-05-07T20:26:43.8224232Z 
2025-05-07T20:26:43.8224243Z 
2025-05-07T20:26:43.8224249Z 
2025-05-07T20:26:43.8224254Z 
2025-05-07T20:26:43.8224259Z 
2025-05-07T20:26:43.8224264Z 
2025-05-07T20:26:43.8224270Z 
2025-05-07T20:26:43.8224275Z 
2025-05-07T20:26:43.8224280Z 
2025-05-07T20:26:43.8224938Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8225205Z 
2025-05-07T20:26:43.8225212Z 
2025-05-07T20:26:43.8225217Z 
2025-05-07T20:26:43.8225229Z 
2025-05-07T20:26:43.8225234Z 
2025-05-07T20:26:43.8225239Z 
2025-05-07T20:26:43.8225244Z 
2025-05-07T20:26:43.8225250Z 
2025-05-07T20:26:43.8225254Z 
2025-05-07T20:26:43.8225259Z 
2025-05-07T20:26:43.8225264Z 
2025-05-07T20:26:43.8225469Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8225714Z 
2025-05-07T20:26:43.8225719Z 
2025-05-07T20:26:43.8225724Z 
2025-05-07T20:26:43.8225729Z 
2025-05-07T20:26:43.8225735Z 
2025-05-07T20:26:43.8225740Z 
2025-05-07T20:26:43.8225745Z 
2025-05-07T20:26:43.8225750Z 
2025-05-07T20:26:43.8225755Z 
2025-05-07T20:26:43.8225760Z 
2025-05-07T20:26:43.8225765Z 
2025-05-07T20:26:43.8225770Z 
2025-05-07T20:26:43.8225970Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8226227Z 
2025-05-07T20:26:43.8226233Z 
2025-05-07T20:26:43.8226238Z 
2025-05-07T20:26:43.8226243Z 
2025-05-07T20:26:43.8226248Z 
2025-05-07T20:26:43.8226260Z 
2025-05-07T20:26:43.8226266Z 
2025-05-07T20:26:43.8226271Z 
2025-05-07T20:26:43.8226276Z 
2025-05-07T20:26:43.8226281Z 
2025-05-07T20:26:43.8226382Z 
2025-05-07T20:26:43.8226387Z 
2025-05-07T20:26:43.8226393Z 
2025-05-07T20:26:43.8226616Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8226876Z 
2025-05-07T20:26:43.8226882Z 
2025-05-07T20:26:43.8226887Z 
2025-05-07T20:26:43.8226892Z 
2025-05-07T20:26:43.8226897Z 
2025-05-07T20:26:43.8226902Z 
2025-05-07T20:26:43.8226915Z 
2025-05-07T20:26:43.8226920Z 
2025-05-07T20:26:43.8226925Z 
2025-05-07T20:26:43.8226930Z 
2025-05-07T20:26:43.8226935Z 
2025-05-07T20:26:43.8226940Z 
2025-05-07T20:26:43.8226945Z 
2025-05-07T20:26:43.8226950Z 
2025-05-07T20:26:43.8227158Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8227430Z 
2025-05-07T20:26:43.8227434Z 
2025-05-07T20:26:43.8227440Z 
2025-05-07T20:26:43.8227444Z 
2025-05-07T20:26:43.8227450Z 
2025-05-07T20:26:43.8227455Z 
2025-05-07T20:26:43.8227460Z 
2025-05-07T20:26:43.8227472Z 
2025-05-07T20:26:43.8227478Z 
2025-05-07T20:26:43.8227483Z 
2025-05-07T20:26:43.8227488Z 
2025-05-07T20:26:43.8227493Z 
2025-05-07T20:26:43.8227505Z 
2025-05-07T20:26:43.8227510Z 
2025-05-07T20:26:43.8227515Z 
2025-05-07T20:26:43.8227774Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8228051Z 
2025-05-07T20:26:43.8228056Z 
2025-05-07T20:26:43.8228061Z 
2025-05-07T20:26:43.8228066Z 
2025-05-07T20:26:43.8228071Z 
2025-05-07T20:26:43.8228077Z 
2025-05-07T20:26:43.8228082Z 
2025-05-07T20:26:43.8228087Z 
2025-05-07T20:26:43.8228092Z 
2025-05-07T20:26:43.8228097Z 
2025-05-07T20:26:43.8228102Z 
2025-05-07T20:26:43.8228114Z 
2025-05-07T20:26:43.8228120Z 
2025-05-07T20:26:43.8228124Z 
2025-05-07T20:26:43.8228129Z 
2025-05-07T20:26:43.8228134Z 
2025-05-07T20:26:43.8228357Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8228642Z 
2025-05-07T20:26:43.8228648Z 
2025-05-07T20:26:43.8228653Z 
2025-05-07T20:26:43.8228658Z 
2025-05-07T20:26:43.8228663Z 
2025-05-07T20:26:43.8228674Z 
2025-05-07T20:26:43.8228679Z 
2025-05-07T20:26:43.8228685Z 
2025-05-07T20:26:43.8228689Z 
2025-05-07T20:26:43.8228695Z 
2025-05-07T20:26:43.8228705Z 
2025-05-07T20:26:43.8228710Z 
2025-05-07T20:26:43.8228715Z 
2025-05-07T20:26:43.8228720Z 
2025-05-07T20:26:43.8228725Z 
2025-05-07T20:26:43.8228730Z 
2025-05-07T20:26:43.8228735Z 
2025-05-07T20:26:43.8228967Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8229251Z 
2025-05-07T20:26:43.8229256Z 
2025-05-07T20:26:43.8229261Z 
2025-05-07T20:26:43.8229266Z 
2025-05-07T20:26:43.8229271Z 
2025-05-07T20:26:43.8229276Z 
2025-05-07T20:26:43.8229281Z 
2025-05-07T20:26:43.8229286Z 
2025-05-07T20:26:43.8229291Z 
2025-05-07T20:26:43.8229296Z 
2025-05-07T20:26:43.8229301Z 
2025-05-07T20:26:43.8229306Z 
2025-05-07T20:26:43.8229319Z 
2025-05-07T20:26:43.8229324Z 
2025-05-07T20:26:43.8229329Z 
2025-05-07T20:26:43.8229334Z 
2025-05-07T20:26:43.8229339Z 
2025-05-07T20:26:43.8229344Z 
2025-05-07T20:26:43.8229936Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8230251Z 
2025-05-07T20:26:43.8230257Z 
2025-05-07T20:26:43.8230422Z [A
2025-05-07T20:26:43.8230566Z 
2025-05-07T20:26:43.8230579Z 
2025-05-07T20:26:43.8230721Z [A[A
2025-05-07T20:26:43.8230874Z 
2025-05-07T20:26:43.8230880Z 
2025-05-07T20:26:43.8230885Z 
2025-05-07T20:26:43.8231036Z [A[A[A
2025-05-07T20:26:43.8231195Z 
2025-05-07T20:26:43.8231201Z 
2025-05-07T20:26:43.8231206Z 
2025-05-07T20:26:43.8231211Z 
2025-05-07T20:26:43.8231373Z [A[A[A[A
2025-05-07T20:26:43.8231541Z 
2025-05-07T20:26:43.8231546Z 
2025-05-07T20:26:43.8231551Z 
2025-05-07T20:26:43.8231566Z 
2025-05-07T20:26:43.8231575Z 
2025-05-07T20:26:43.8232012Z [A[A[A[A[A
2025-05-07T20:26:43.8232201Z 
2025-05-07T20:26:43.8232207Z 
2025-05-07T20:26:43.8232212Z 
2025-05-07T20:26:43.8232217Z 
2025-05-07T20:26:43.8232222Z 
2025-05-07T20:26:43.8232230Z 
2025-05-07T20:26:43.8232426Z [A[A[A[A[A[A
2025-05-07T20:26:43.8232609Z 
2025-05-07T20:26:43.8232614Z 
2025-05-07T20:26:43.8232625Z 
2025-05-07T20:26:43.8232640Z 
2025-05-07T20:26:43.8232645Z 
2025-05-07T20:26:43.8232650Z 
2025-05-07T20:26:43.8232655Z 
2025-05-07T20:26:43.8233029Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8233359Z 
2025-05-07T20:26:43.8233363Z 
2025-05-07T20:26:43.8233367Z 
2025-05-07T20:26:43.8233370Z 
2025-05-07T20:26:43.8233374Z 
2025-05-07T20:26:43.8233377Z 
2025-05-07T20:26:43.8233381Z 
2025-05-07T20:26:43.8233385Z 
2025-05-07T20:26:43.8233520Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8233668Z 
2025-05-07T20:26:43.8233672Z 
2025-05-07T20:26:43.8233679Z 
2025-05-07T20:26:43.8233682Z 
2025-05-07T20:26:43.8233686Z 
2025-05-07T20:26:43.8233690Z 
2025-05-07T20:26:43.8233693Z 
2025-05-07T20:26:43.8233697Z 
2025-05-07T20:26:43.8233700Z 
2025-05-07T20:26:43.8234134Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8234360Z 
2025-05-07T20:26:43.8234365Z 
2025-05-07T20:26:43.8234370Z 
2025-05-07T20:26:43.8234382Z 
2025-05-07T20:26:43.8234396Z 
2025-05-07T20:26:43.8234401Z 
2025-05-07T20:26:43.8234406Z 
2025-05-07T20:26:43.8234411Z 
2025-05-07T20:26:43.8234427Z 
2025-05-07T20:26:43.8234432Z 
2025-05-07T20:26:43.8234616Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8234845Z 
2025-05-07T20:26:43.8234860Z 
2025-05-07T20:26:43.8234865Z 
2025-05-07T20:26:43.8234870Z 
2025-05-07T20:26:43.8234875Z 
2025-05-07T20:26:43.8234880Z 
2025-05-07T20:26:43.8234886Z 
2025-05-07T20:26:43.8234891Z 
2025-05-07T20:26:43.8234896Z 
2025-05-07T20:26:43.8234900Z 
2025-05-07T20:26:43.8234905Z 
2025-05-07T20:26:43.8235100Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8235341Z 
2025-05-07T20:26:43.8235346Z 
2025-05-07T20:26:43.8235352Z 
2025-05-07T20:26:43.8235357Z 
2025-05-07T20:26:43.8235362Z 
2025-05-07T20:26:43.8235367Z 
2025-05-07T20:26:43.8235372Z 
2025-05-07T20:26:43.8235377Z 
2025-05-07T20:26:43.8235382Z 
2025-05-07T20:26:43.8235387Z 
2025-05-07T20:26:43.8235477Z 
2025-05-07T20:26:43.8235483Z 
2025-05-07T20:26:43.8235730Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8235981Z 
2025-05-07T20:26:43.8235986Z 
2025-05-07T20:26:43.8235991Z 
2025-05-07T20:26:43.8236004Z 
2025-05-07T20:26:43.8236009Z 
2025-05-07T20:26:43.8236023Z 
2025-05-07T20:26:43.8236028Z 
2025-05-07T20:26:43.8236039Z 
2025-05-07T20:26:43.8236045Z 
2025-05-07T20:26:43.8236050Z 
2025-05-07T20:26:43.8236065Z 
2025-05-07T20:26:43.8236071Z 
2025-05-07T20:26:43.8236076Z 
2025-05-07T20:26:43.8236273Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8236527Z 
2025-05-07T20:26:43.8236532Z 
2025-05-07T20:26:43.8236557Z 
2025-05-07T20:26:43.8236563Z 
2025-05-07T20:26:43.8236568Z 
2025-05-07T20:26:43.8236573Z 
2025-05-07T20:26:43.8236578Z 
2025-05-07T20:26:43.8236583Z 
2025-05-07T20:26:43.8236588Z 
2025-05-07T20:26:43.8236593Z 
2025-05-07T20:26:43.8236598Z 
2025-05-07T20:26:43.8236604Z 
2025-05-07T20:26:43.8236609Z 
2025-05-07T20:26:43.8236614Z 
2025-05-07T20:26:43.8236815Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8237085Z 
2025-05-07T20:26:43.8237090Z 
2025-05-07T20:26:43.8237095Z 
2025-05-07T20:26:43.8237101Z 
2025-05-07T20:26:43.8237106Z 
2025-05-07T20:26:43.8237230Z 
2025-05-07T20:26:43.8237237Z 
2025-05-07T20:26:43.8237242Z 
2025-05-07T20:26:43.8237247Z 
2025-05-07T20:26:43.8237252Z 
2025-05-07T20:26:43.8237264Z 
2025-05-07T20:26:43.8237269Z 
2025-05-07T20:26:43.8237274Z 
2025-05-07T20:26:43.8237280Z 
2025-05-07T20:26:43.8237296Z 
2025-05-07T20:26:43.8237541Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8237835Z 
2025-05-07T20:26:43.8237841Z 
2025-05-07T20:26:43.8237845Z 
2025-05-07T20:26:43.8237850Z 
2025-05-07T20:26:43.8237855Z 
2025-05-07T20:26:43.8237861Z 
2025-05-07T20:26:43.8237866Z 
2025-05-07T20:26:43.8237871Z 
2025-05-07T20:26:43.8237876Z 
2025-05-07T20:26:43.8237889Z 
2025-05-07T20:26:43.8237894Z 
2025-05-07T20:26:43.8237899Z 
2025-05-07T20:26:43.8237905Z 
2025-05-07T20:26:43.8237910Z 
2025-05-07T20:26:43.8237915Z 
2025-05-07T20:26:43.8237920Z 
2025-05-07T20:26:43.8238142Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8238427Z 
2025-05-07T20:26:43.8238432Z 
2025-05-07T20:26:43.8238437Z 
2025-05-07T20:26:43.8238449Z 
2025-05-07T20:26:43.8238454Z 
2025-05-07T20:26:43.8238460Z 
2025-05-07T20:26:43.8238465Z 
2025-05-07T20:26:43.8238470Z 
2025-05-07T20:26:43.8238565Z 
2025-05-07T20:26:43.8238570Z 
2025-05-07T20:26:43.8238575Z 
2025-05-07T20:26:43.8238580Z 
2025-05-07T20:26:43.8238585Z 
2025-05-07T20:26:43.8238590Z 
2025-05-07T20:26:43.8238595Z 
2025-05-07T20:26:43.8238600Z 
2025-05-07T20:26:43.8238605Z 
2025-05-07T20:26:43.8238840Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8239130Z 
2025-05-07T20:26:43.8239136Z 
2025-05-07T20:26:43.8239141Z 
2025-05-07T20:26:43.8239146Z 
2025-05-07T20:26:43.8239151Z 
2025-05-07T20:26:43.8239156Z 
2025-05-07T20:26:43.8239162Z 
2025-05-07T20:26:43.8239167Z 
2025-05-07T20:26:43.8239172Z 
2025-05-07T20:26:43.8239185Z 
2025-05-07T20:26:43.8239190Z 
2025-05-07T20:26:43.8239195Z 
2025-05-07T20:26:43.8239201Z 
2025-05-07T20:26:43.8239206Z 
2025-05-07T20:26:43.8239212Z 
2025-05-07T20:26:43.8239217Z 
2025-05-07T20:26:43.8239222Z 
2025-05-07T20:26:43.8239227Z 
2025-05-07T20:26:43.8239473Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8239769Z 
2025-05-07T20:26:43.8239782Z 
2025-05-07T20:26:43.8239927Z [A
2025-05-07T20:26:43.8240065Z 
2025-05-07T20:26:43.8240070Z 
2025-05-07T20:26:43.8240230Z [A[A
2025-05-07T20:26:43.8240380Z 
2025-05-07T20:26:43.8240386Z 
2025-05-07T20:26:43.8240391Z 
2025-05-07T20:26:43.8240553Z [A[A[A
2025-05-07T20:26:43.8240703Z 
2025-05-07T20:26:43.8240708Z 
2025-05-07T20:26:43.8240713Z 
2025-05-07T20:26:43.8240718Z 
2025-05-07T20:26:43.8241073Z [A[A[A[A
2025-05-07T20:26:43.8241232Z 
2025-05-07T20:26:43.8241237Z 
2025-05-07T20:26:43.8241242Z 
2025-05-07T20:26:43.8241247Z 
2025-05-07T20:26:43.8241255Z 
2025-05-07T20:26:43.8241520Z [A[A[A[A[A
2025-05-07T20:26:43.8241689Z 
2025-05-07T20:26:43.8241694Z 
2025-05-07T20:26:43.8241702Z 
2025-05-07T20:26:43.8241707Z 
2025-05-07T20:26:43.8241712Z 
2025-05-07T20:26:43.8241717Z 
2025-05-07T20:26:43.8242117Z [A[A[A[A[A[A
2025-05-07T20:26:43.8242314Z 
2025-05-07T20:26:43.8242329Z 
2025-05-07T20:26:43.8242335Z 
2025-05-07T20:26:43.8242340Z 
2025-05-07T20:26:43.8242345Z 
2025-05-07T20:26:43.8242350Z 
2025-05-07T20:26:43.8242369Z 
2025-05-07T20:26:43.8242542Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8242736Z 
2025-05-07T20:26:43.8242741Z 
2025-05-07T20:26:43.8242746Z 
2025-05-07T20:26:43.8242751Z 
2025-05-07T20:26:43.8242756Z 
2025-05-07T20:26:43.8242761Z 
2025-05-07T20:26:43.8242766Z 
2025-05-07T20:26:43.8242771Z 
2025-05-07T20:26:43.8242989Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8243141Z 
2025-05-07T20:26:43.8243145Z 
2025-05-07T20:26:43.8243148Z 
2025-05-07T20:26:43.8243152Z 
2025-05-07T20:26:43.8243155Z 
2025-05-07T20:26:43.8243159Z 
2025-05-07T20:26:43.8243162Z 
2025-05-07T20:26:43.8243175Z 
2025-05-07T20:26:43.8243179Z 
2025-05-07T20:26:43.8243433Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8243586Z 
2025-05-07T20:26:43.8243590Z 
2025-05-07T20:26:43.8243593Z 
2025-05-07T20:26:43.8243597Z 
2025-05-07T20:26:43.8243617Z 
2025-05-07T20:26:43.8243785Z 
2025-05-07T20:26:43.8243792Z 
2025-05-07T20:26:43.8243797Z 
2025-05-07T20:26:43.8243802Z 
2025-05-07T20:26:43.8243807Z 
2025-05-07T20:26:43.8244013Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8244228Z 
2025-05-07T20:26:43.8244232Z 
2025-05-07T20:26:43.8244235Z 
2025-05-07T20:26:43.8244239Z 
2025-05-07T20:26:43.8244243Z 
2025-05-07T20:26:43.8244246Z 
2025-05-07T20:26:43.8244250Z 
2025-05-07T20:26:43.8244253Z 
2025-05-07T20:26:43.8244257Z 
2025-05-07T20:26:43.8244261Z 
2025-05-07T20:26:43.8244271Z 
2025-05-07T20:26:43.8244411Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8244591Z 
2025-05-07T20:26:43.8244595Z 
2025-05-07T20:26:43.8244599Z 
2025-05-07T20:26:43.8244602Z 
2025-05-07T20:26:43.8244606Z 
2025-05-07T20:26:43.8244610Z 
2025-05-07T20:26:43.8244613Z 
2025-05-07T20:26:43.8244617Z 
2025-05-07T20:26:43.8244621Z 
2025-05-07T20:26:43.8244624Z 
2025-05-07T20:26:43.8244628Z 
2025-05-07T20:26:43.8244631Z 
2025-05-07T20:26:43.8244780Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8244966Z 
2025-05-07T20:26:43.8244970Z 
2025-05-07T20:26:43.8244973Z 
2025-05-07T20:26:43.8244977Z 
2025-05-07T20:26:43.8244980Z 
2025-05-07T20:26:43.8245073Z 
2025-05-07T20:26:43.8245076Z 
2025-05-07T20:26:43.8245080Z 
2025-05-07T20:26:43.8245083Z 
2025-05-07T20:26:43.8245087Z 
2025-05-07T20:26:43.8245090Z 
2025-05-07T20:26:43.8245094Z 
2025-05-07T20:26:43.8245105Z 
2025-05-07T20:26:43.8245251Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8245437Z 
2025-05-07T20:26:43.8245440Z 
2025-05-07T20:26:43.8245444Z 
2025-05-07T20:26:43.8245447Z 
2025-05-07T20:26:43.8245451Z 
2025-05-07T20:26:43.8245460Z 
2025-05-07T20:26:43.8245464Z 
2025-05-07T20:26:43.8245467Z 
2025-05-07T20:26:43.8245471Z 
2025-05-07T20:26:43.8245474Z 
2025-05-07T20:26:43.8245478Z 
2025-05-07T20:26:43.8245481Z 
2025-05-07T20:26:43.8245485Z 
2025-05-07T20:26:43.8245488Z 
2025-05-07T20:26:43.8245635Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8245834Z 
2025-05-07T20:26:43.8245838Z 
2025-05-07T20:26:43.8245841Z 
2025-05-07T20:26:43.8245849Z 
2025-05-07T20:26:43.8245853Z 
2025-05-07T20:26:43.8245856Z 
2025-05-07T20:26:43.8245860Z 
2025-05-07T20:26:43.8245863Z 
2025-05-07T20:26:43.8245872Z 
2025-05-07T20:26:43.8245875Z 
2025-05-07T20:26:43.8245879Z 
2025-05-07T20:26:43.8245882Z 
2025-05-07T20:26:43.8245886Z 
2025-05-07T20:26:43.8245889Z 
2025-05-07T20:26:43.8245893Z 
2025-05-07T20:26:43.8246060Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8246265Z 
2025-05-07T20:26:43.8246269Z 
2025-05-07T20:26:43.8246272Z 
2025-05-07T20:26:43.8246276Z 
2025-05-07T20:26:43.8246279Z 
2025-05-07T20:26:43.8246283Z 
2025-05-07T20:26:43.8246287Z 
2025-05-07T20:26:43.8246290Z 
2025-05-07T20:26:43.8246293Z 
2025-05-07T20:26:43.8246297Z 
2025-05-07T20:26:43.8246300Z 
2025-05-07T20:26:43.8246310Z 
2025-05-07T20:26:43.8246313Z 
2025-05-07T20:26:43.8246317Z 
2025-05-07T20:26:43.8246320Z 
2025-05-07T20:26:43.8246324Z 
2025-05-07T20:26:43.8246489Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8246697Z 
2025-05-07T20:26:43.8246706Z 
2025-05-07T20:26:43.8246709Z 
2025-05-07T20:26:43.8246713Z 
2025-05-07T20:26:43.8246716Z 
2025-05-07T20:26:43.8246720Z 
2025-05-07T20:26:43.8246728Z 
2025-05-07T20:26:43.8246732Z 
2025-05-07T20:26:43.8246735Z 
2025-05-07T20:26:43.8246739Z 
2025-05-07T20:26:43.8246742Z 
2025-05-07T20:26:43.8246746Z 
2025-05-07T20:26:43.8246749Z 
2025-05-07T20:26:43.8246753Z 
2025-05-07T20:26:43.8246756Z 
2025-05-07T20:26:43.8246760Z 
2025-05-07T20:26:43.8246763Z 
2025-05-07T20:26:43.8246932Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8247142Z 
2025-05-07T20:26:43.8247145Z 
2025-05-07T20:26:43.8247149Z 
2025-05-07T20:26:43.8247152Z 
2025-05-07T20:26:43.8247156Z 
2025-05-07T20:26:43.8247159Z 
2025-05-07T20:26:43.8247163Z 
2025-05-07T20:26:43.8247166Z 
2025-05-07T20:26:43.8247170Z 
2025-05-07T20:26:43.8247173Z 
2025-05-07T20:26:43.8247177Z 
2025-05-07T20:26:43.8247188Z 
2025-05-07T20:26:43.8247192Z 
2025-05-07T20:26:43.8247195Z 
2025-05-07T20:26:43.8247199Z 
2025-05-07T20:26:43.8247280Z 
2025-05-07T20:26:43.8247284Z 
2025-05-07T20:26:43.8247288Z 
2025-05-07T20:26:43.8247685Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8248005Z 
2025-05-07T20:26:43.8248011Z 
2025-05-07T20:26:43.8248161Z [A
2025-05-07T20:26:43.8248319Z 
2025-05-07T20:26:43.8248324Z 
2025-05-07T20:26:43.8248472Z [A[A
2025-05-07T20:26:43.8248618Z 
2025-05-07T20:26:43.8248632Z 
2025-05-07T20:26:43.8248637Z 
2025-05-07T20:26:43.8248789Z [A[A[A
2025-05-07T20:26:43.8248940Z 
2025-05-07T20:26:43.8248949Z 
2025-05-07T20:26:43.8248954Z 
2025-05-07T20:26:43.8248959Z 
2025-05-07T20:26:43.8249223Z [A[A[A[A
2025-05-07T20:26:43.8249407Z 
2025-05-07T20:26:43.8249413Z 
2025-05-07T20:26:43.8249418Z 
2025-05-07T20:26:43.8249423Z 
2025-05-07T20:26:43.8249431Z 
2025-05-07T20:26:43.8249630Z [A[A[A[A[A
2025-05-07T20:26:43.8249803Z 
2025-05-07T20:26:43.8249813Z 
2025-05-07T20:26:43.8249818Z 
2025-05-07T20:26:43.8249824Z 
2025-05-07T20:26:43.8249829Z 
2025-05-07T20:26:43.8249834Z 
2025-05-07T20:26:43.8250006Z [A[A[A[A[A[A
2025-05-07T20:26:43.8250188Z 
2025-05-07T20:26:43.8250201Z 
2025-05-07T20:26:43.8250206Z 
2025-05-07T20:26:43.8250347Z 
2025-05-07T20:26:43.8250353Z 
2025-05-07T20:26:43.8250357Z 
2025-05-07T20:26:43.8250363Z 
2025-05-07T20:26:43.8250541Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8250737Z 
2025-05-07T20:26:43.8250742Z 
2025-05-07T20:26:43.8250748Z 
2025-05-07T20:26:43.8250752Z 
2025-05-07T20:26:43.8250758Z 
2025-05-07T20:26:43.8250763Z 
2025-05-07T20:26:43.8250768Z 
2025-05-07T20:26:43.8250773Z 
2025-05-07T20:26:43.8250944Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8251151Z 
2025-05-07T20:26:43.8251156Z 
2025-05-07T20:26:43.8251161Z 
2025-05-07T20:26:43.8251166Z 
2025-05-07T20:26:43.8251171Z 
2025-05-07T20:26:43.8251176Z 
2025-05-07T20:26:43.8251181Z 
2025-05-07T20:26:43.8251186Z 
2025-05-07T20:26:43.8251191Z 
2025-05-07T20:26:43.8251359Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8251576Z 
2025-05-07T20:26:43.8251581Z 
2025-05-07T20:26:43.8251586Z 
2025-05-07T20:26:43.8251607Z 
2025-05-07T20:26:43.8251612Z 
2025-05-07T20:26:43.8251617Z 
2025-05-07T20:26:43.8251622Z 
2025-05-07T20:26:43.8251627Z 
2025-05-07T20:26:43.8251641Z 
2025-05-07T20:26:43.8251646Z 
2025-05-07T20:26:43.8251829Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8252050Z 
2025-05-07T20:26:43.8252056Z 
2025-05-07T20:26:43.8252061Z 
2025-05-07T20:26:43.8252066Z 
2025-05-07T20:26:43.8252071Z 
2025-05-07T20:26:43.8252076Z 
2025-05-07T20:26:43.8252081Z 
2025-05-07T20:26:43.8252086Z 
2025-05-07T20:26:43.8252091Z 
2025-05-07T20:26:43.8252096Z 
2025-05-07T20:26:43.8252101Z 
2025-05-07T20:26:43.8252286Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8252528Z 
2025-05-07T20:26:43.8252534Z 
2025-05-07T20:26:43.8252539Z 
2025-05-07T20:26:43.8252544Z 
2025-05-07T20:26:43.8252549Z 
2025-05-07T20:26:43.8252554Z 
2025-05-07T20:26:43.8252559Z 
2025-05-07T20:26:43.8252564Z 
2025-05-07T20:26:43.8252579Z 
2025-05-07T20:26:43.8252584Z 
2025-05-07T20:26:43.8252589Z 
2025-05-07T20:26:43.8252594Z 
2025-05-07T20:26:43.8252807Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8253062Z 
2025-05-07T20:26:43.8253067Z 
2025-05-07T20:26:43.8253072Z 
2025-05-07T20:26:43.8253084Z 
2025-05-07T20:26:43.8253089Z 
2025-05-07T20:26:43.8253094Z 
2025-05-07T20:26:43.8253099Z 
2025-05-07T20:26:43.8253104Z 
2025-05-07T20:26:43.8253110Z 
2025-05-07T20:26:43.8253114Z 
2025-05-07T20:26:43.8253120Z 
2025-05-07T20:26:43.8253125Z 
2025-05-07T20:26:43.8253130Z 
2025-05-07T20:26:43.8253324Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8253580Z 
2025-05-07T20:26:43.8253586Z 
2025-05-07T20:26:43.8253591Z 
2025-05-07T20:26:43.8253596Z 
2025-05-07T20:26:43.8253601Z 
2025-05-07T20:26:43.8253606Z 
2025-05-07T20:26:43.8253611Z 
2025-05-07T20:26:43.8253616Z 
2025-05-07T20:26:43.8253621Z 
2025-05-07T20:26:43.8253626Z 
2025-05-07T20:26:43.8253631Z 
2025-05-07T20:26:43.8253636Z 
2025-05-07T20:26:43.8253642Z 
2025-05-07T20:26:43.8253647Z 
2025-05-07T20:26:43.8253858Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8254222Z 
2025-05-07T20:26:43.8254228Z 
2025-05-07T20:26:43.8254233Z 
2025-05-07T20:26:43.8254238Z 
2025-05-07T20:26:43.8254243Z 
2025-05-07T20:26:43.8254255Z 
2025-05-07T20:26:43.8254260Z 
2025-05-07T20:26:43.8254265Z 
2025-05-07T20:26:43.8254270Z 
2025-05-07T20:26:43.8254275Z 
2025-05-07T20:26:43.8254280Z 
2025-05-07T20:26:43.8254294Z 
2025-05-07T20:26:43.8254300Z 
2025-05-07T20:26:43.8254305Z 
2025-05-07T20:26:43.8254310Z 
2025-05-07T20:26:43.8254538Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8254813Z 
2025-05-07T20:26:43.8254818Z 
2025-05-07T20:26:43.8254823Z 
2025-05-07T20:26:43.8254828Z 
2025-05-07T20:26:43.8254833Z 
2025-05-07T20:26:43.8254838Z 
2025-05-07T20:26:43.8254843Z 
2025-05-07T20:26:43.8254848Z 
2025-05-07T20:26:43.8254854Z 
2025-05-07T20:26:43.8254859Z 
2025-05-07T20:26:43.8254864Z 
2025-05-07T20:26:43.8254869Z 
2025-05-07T20:26:43.8254874Z 
2025-05-07T20:26:43.8254879Z 
2025-05-07T20:26:43.8254884Z 
2025-05-07T20:26:43.8254889Z 
2025-05-07T20:26:43.8255111Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8255386Z 
2025-05-07T20:26:43.8255391Z 
2025-05-07T20:26:43.8255396Z 
2025-05-07T20:26:43.8255488Z 
2025-05-07T20:26:43.8255493Z 
2025-05-07T20:26:43.8255498Z 
2025-05-07T20:26:43.8255503Z 
2025-05-07T20:26:43.8255507Z 
2025-05-07T20:26:43.8255513Z 
2025-05-07T20:26:43.8255517Z 
2025-05-07T20:26:43.8255522Z 
2025-05-07T20:26:43.8255527Z 
2025-05-07T20:26:43.8255531Z 
2025-05-07T20:26:43.8255536Z 
2025-05-07T20:26:43.8255542Z 
2025-05-07T20:26:43.8255558Z 
2025-05-07T20:26:43.8255563Z 
2025-05-07T20:26:43.8255781Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8256061Z 
2025-05-07T20:26:43.8256066Z 
2025-05-07T20:26:43.8256071Z 
2025-05-07T20:26:43.8256076Z 
2025-05-07T20:26:43.8256081Z 
2025-05-07T20:26:43.8256094Z 
2025-05-07T20:26:43.8256100Z 
2025-05-07T20:26:43.8256105Z 
2025-05-07T20:26:43.8256110Z 
2025-05-07T20:26:43.8256115Z 
2025-05-07T20:26:43.8256120Z 
2025-05-07T20:26:43.8256125Z 
2025-05-07T20:26:43.8256130Z 
2025-05-07T20:26:43.8256141Z 
2025-05-07T20:26:43.8256147Z 
2025-05-07T20:26:43.8256151Z 
2025-05-07T20:26:43.8256157Z 
2025-05-07T20:26:43.8256162Z 
2025-05-07T20:26:43.8256399Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8256695Z 
2025-05-07T20:26:43.8256700Z 
2025-05-07T20:26:43.8256839Z [A
2025-05-07T20:26:43.8256985Z 
2025-05-07T20:26:43.8256990Z 
2025-05-07T20:26:43.8257134Z [A[A
2025-05-07T20:26:43.8257279Z 
2025-05-07T20:26:43.8257284Z 
2025-05-07T20:26:43.8257289Z 
2025-05-07T20:26:43.8257444Z [A[A[A
2025-05-07T20:26:43.8257590Z 
2025-05-07T20:26:43.8257596Z 
2025-05-07T20:26:43.8257601Z 
2025-05-07T20:26:43.8257606Z 
2025-05-07T20:26:43.8257771Z [A[A[A[A
2025-05-07T20:26:43.8257929Z 
2025-05-07T20:26:43.8257935Z 
2025-05-07T20:26:43.8257940Z 
2025-05-07T20:26:43.8257945Z 
2025-05-07T20:26:43.8257951Z 
2025-05-07T20:26:43.8258114Z [A[A[A[A[A
2025-05-07T20:26:43.8258283Z 
2025-05-07T20:26:43.8258289Z 
2025-05-07T20:26:43.8258293Z 
2025-05-07T20:26:43.8258299Z 
2025-05-07T20:26:43.8258312Z 
2025-05-07T20:26:43.8258317Z 
2025-05-07T20:26:43.8258489Z [A[A[A[A[A[A
2025-05-07T20:26:43.8258665Z 
2025-05-07T20:26:43.8258678Z 
2025-05-07T20:26:43.8258684Z 
2025-05-07T20:26:43.8258689Z 
2025-05-07T20:26:43.8258695Z 
2025-05-07T20:26:43.8258700Z 
2025-05-07T20:26:43.8258706Z 
2025-05-07T20:26:43.8258867Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8259064Z 
2025-05-07T20:26:43.8259069Z 
2025-05-07T20:26:43.8259074Z 
2025-05-07T20:26:43.8259092Z 
2025-05-07T20:26:43.8259097Z 
2025-05-07T20:26:43.8259102Z 
2025-05-07T20:26:43.8259107Z 
2025-05-07T20:26:43.8259112Z 
2025-05-07T20:26:43.8259289Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8259493Z 
2025-05-07T20:26:43.8259498Z 
2025-05-07T20:26:43.8259503Z 
2025-05-07T20:26:43.8259508Z 
2025-05-07T20:26:43.8259513Z 
2025-05-07T20:26:43.8259518Z 
2025-05-07T20:26:43.8259523Z 
2025-05-07T20:26:43.8259528Z 
2025-05-07T20:26:43.8259533Z 
2025-05-07T20:26:43.8259712Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8259923Z 
2025-05-07T20:26:43.8260064Z 
2025-05-07T20:26:43.8260071Z 
2025-05-07T20:26:43.8260076Z 
2025-05-07T20:26:43.8260081Z 
2025-05-07T20:26:43.8260087Z 
2025-05-07T20:26:43.8260098Z 
2025-05-07T20:26:43.8260103Z 
2025-05-07T20:26:43.8260108Z 
2025-05-07T20:26:43.8260112Z 
2025-05-07T20:26:43.8260311Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8260533Z 
2025-05-07T20:26:43.8260537Z 
2025-05-07T20:26:43.8260542Z 
2025-05-07T20:26:43.8260547Z 
2025-05-07T20:26:43.8260551Z 
2025-05-07T20:26:43.8260556Z 
2025-05-07T20:26:43.8260562Z 
2025-05-07T20:26:43.8260575Z 
2025-05-07T20:26:43.8260581Z 
2025-05-07T20:26:43.8260586Z 
2025-05-07T20:26:43.8260591Z 
2025-05-07T20:26:43.8260775Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8261009Z 
2025-05-07T20:26:43.8261015Z 
2025-05-07T20:26:43.8261020Z 
2025-05-07T20:26:43.8261056Z 
2025-05-07T20:26:43.8261061Z 
2025-05-07T20:26:43.8261066Z 
2025-05-07T20:26:43.8261071Z 
2025-05-07T20:26:43.8261075Z 
2025-05-07T20:26:43.8261080Z 
2025-05-07T20:26:43.8261085Z 
2025-05-07T20:26:43.8261098Z 
2025-05-07T20:26:43.8261103Z 
2025-05-07T20:26:43.8261293Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8261747Z 
2025-05-07T20:26:43.8261751Z 
2025-05-07T20:26:43.8261755Z 
2025-05-07T20:26:43.8261758Z 
2025-05-07T20:26:43.8261769Z 
2025-05-07T20:26:43.8261773Z 
2025-05-07T20:26:43.8261776Z 
2025-05-07T20:26:43.8261780Z 
2025-05-07T20:26:43.8261783Z 
2025-05-07T20:26:43.8261787Z 
2025-05-07T20:26:43.8261790Z 
2025-05-07T20:26:43.8261794Z 
2025-05-07T20:26:43.8261797Z 
2025-05-07T20:26:43.8261944Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8262134Z 
2025-05-07T20:26:43.8262138Z 
2025-05-07T20:26:43.8262142Z 
2025-05-07T20:26:43.8262145Z 
2025-05-07T20:26:43.8262149Z 
2025-05-07T20:26:43.8262152Z 
2025-05-07T20:26:43.8262156Z 
2025-05-07T20:26:43.8262160Z 
2025-05-07T20:26:43.8262163Z 
2025-05-07T20:26:43.8262167Z 
2025-05-07T20:26:43.8262170Z 
2025-05-07T20:26:43.8262174Z 
2025-05-07T20:26:43.8262178Z 
2025-05-07T20:26:43.8262181Z 
2025-05-07T20:26:43.8262335Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8262617Z 
2025-05-07T20:26:43.8262622Z 
2025-05-07T20:26:43.8262626Z 
2025-05-07T20:26:43.8262640Z 
2025-05-07T20:26:43.8262645Z 
2025-05-07T20:26:43.8262650Z 
2025-05-07T20:26:43.8262655Z 
2025-05-07T20:26:43.8262659Z 
2025-05-07T20:26:43.8262664Z 
2025-05-07T20:26:43.8262669Z 
2025-05-07T20:26:43.8262673Z 
2025-05-07T20:26:43.8262678Z 
2025-05-07T20:26:43.8262683Z 
2025-05-07T20:26:43.8262688Z 
2025-05-07T20:26:43.8262693Z 
2025-05-07T20:26:43.8262922Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8263196Z 
2025-05-07T20:26:43.8263201Z 
2025-05-07T20:26:43.8263206Z 
2025-05-07T20:26:43.8263211Z 
2025-05-07T20:26:43.8263216Z 
2025-05-07T20:26:43.8263221Z 
2025-05-07T20:26:43.8263226Z 
2025-05-07T20:26:43.8263232Z 
2025-05-07T20:26:43.8263247Z 
2025-05-07T20:26:43.8263252Z 
2025-05-07T20:26:43.8263257Z 
2025-05-07T20:26:43.8263262Z 
2025-05-07T20:26:43.8263267Z 
2025-05-07T20:26:43.8263272Z 
2025-05-07T20:26:43.8263277Z 
2025-05-07T20:26:43.8263288Z 
2025-05-07T20:26:43.8263501Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8263784Z 
2025-05-07T20:26:43.8263796Z 
2025-05-07T20:26:43.8263801Z 
2025-05-07T20:26:43.8263806Z 
2025-05-07T20:26:43.8263811Z 
2025-05-07T20:26:43.8263816Z 
2025-05-07T20:26:43.8263821Z 
2025-05-07T20:26:43.8263826Z 
2025-05-07T20:26:43.8263831Z 
2025-05-07T20:26:43.8263836Z 
2025-05-07T20:26:43.8263841Z 
2025-05-07T20:26:43.8263846Z 
2025-05-07T20:26:43.8263851Z 
2025-05-07T20:26:43.8263856Z 
2025-05-07T20:26:43.8263861Z 
2025-05-07T20:26:43.8263866Z 
2025-05-07T20:26:43.8263871Z 
2025-05-07T20:26:43.8264095Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8264377Z 
2025-05-07T20:26:43.8264382Z 
2025-05-07T20:26:43.8264387Z 
2025-05-07T20:26:43.8264392Z 
2025-05-07T20:26:43.8264397Z 
2025-05-07T20:26:43.8264402Z 
2025-05-07T20:26:43.8264407Z 
2025-05-07T20:26:43.8264412Z 
2025-05-07T20:26:43.8264417Z 
2025-05-07T20:26:43.8264421Z 
2025-05-07T20:26:43.8264426Z 
2025-05-07T20:26:43.8264545Z 
2025-05-07T20:26:43.8264551Z 
2025-05-07T20:26:43.8264567Z 
2025-05-07T20:26:43.8264572Z 
2025-05-07T20:26:43.8264601Z 
2025-05-07T20:26:43.8264606Z 
2025-05-07T20:26:43.8264611Z 
2025-05-07T20:26:43.8264851Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:43.8265572Z 
2025-05-07T20:26:43.8265577Z 
2025-05-07T20:26:43.8265737Z [A
2025-05-07T20:26:43.8266120Z 
2025-05-07T20:26:43.8266123Z 
2025-05-07T20:26:43.8266247Z [A[A
2025-05-07T20:26:43.8266355Z 
2025-05-07T20:26:43.8266358Z 
2025-05-07T20:26:43.8266362Z 
2025-05-07T20:26:43.8266474Z [A[A[A
2025-05-07T20:26:43.8266584Z 
2025-05-07T20:26:43.8266587Z 
2025-05-07T20:26:43.8266591Z 
2025-05-07T20:26:43.8266594Z 
2025-05-07T20:26:43.8266703Z [A[A[A[A
2025-05-07T20:26:43.8266827Z 
2025-05-07T20:26:43.8266830Z 
2025-05-07T20:26:43.8266834Z 
2025-05-07T20:26:43.8266837Z 
2025-05-07T20:26:43.8266841Z 
2025-05-07T20:26:43.8266952Z [A[A[A[A[A
2025-05-07T20:26:43.8267081Z 
2025-05-07T20:26:43.8267093Z 
2025-05-07T20:26:43.8267097Z 
2025-05-07T20:26:43.8267100Z 
2025-05-07T20:26:43.8267104Z 
2025-05-07T20:26:43.8267107Z 
2025-05-07T20:26:43.8267399Z [A[A[A[A[A[A
2025-05-07T20:26:43.8267550Z 
2025-05-07T20:26:43.8267555Z 
2025-05-07T20:26:43.8267559Z 
2025-05-07T20:26:43.8267564Z 
2025-05-07T20:26:43.8267568Z 
2025-05-07T20:26:43.8267573Z 
2025-05-07T20:26:43.8267577Z 
2025-05-07T20:26:43.8267707Z [A[A[A[A[A[A[A
2025-05-07T20:26:43.8275811Z 
2025-05-07T20:26:43.8275817Z 
2025-05-07T20:26:43.8275822Z 
2025-05-07T20:26:43.8275827Z 
2025-05-07T20:26:43.8275832Z 
2025-05-07T20:26:43.8275838Z 
2025-05-07T20:26:43.8275852Z 
2025-05-07T20:26:43.8275869Z 
2025-05-07T20:26:43.8276128Z [A[A[A[A[A[A[A[A done
2025-05-07T20:26:44.0315843Z Preparing transaction: \ | done
2025-05-07T20:26:48.0465607Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:26:48.6547313Z Executing transaction: \ | / - \ | done
2025-05-07T20:26:50.8241903Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:50.8242340Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:50.8243034Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:50.8243594Z 
2025-05-07T20:26:50.8256469Z 
2025-05-07T20:26:50.8257176Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:50.8257886Z 
2025-05-07T20:26:50.8270695Z 
2025-05-07T20:26:50.8270897Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:50.8276127Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:50.8279830Z 
2025-05-07T20:26:50.9917138Z 
2025-05-07T20:26:50.9922639Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:50.9926401Z 
2025-05-07T20:26:50.9945718Z 
2025-05-07T20:26:50.9946105Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:51.0325466Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:52.9157268Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:52.9794818Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:52.9795339Z 
2025-05-07T20:26:53.4025176Z 
2025-05-07T20:26:53.4036027Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:53.4379106Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:53.4379623Z 
2025-05-07T20:26:53.8759932Z 
2025-05-07T20:26:53.8760341Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:53.8761263Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:53.8761988Z 
2025-05-07T20:26:54.2998481Z 
2025-05-07T20:26:56.3327275Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:58.3806424Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:27:00.4067954Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:27:00.4068750Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:27:02.4396553Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:27:04.3414899Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:27:04.3415281Z 
2025-05-07T20:27:04.4029773Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:27:08.2437528Z /tmp/tmpc19u08lt: line 3: clang: command not found
2025-05-07T20:27:08.2437819Z 
2025-05-07T20:27:08.2438218Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:27:08.3069615Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:27:08.3069930Z 
2025-05-07T20:27:08.3090262Z total 36
2025-05-07T20:27:08.3090563Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:27:08.3090942Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:27:08.3091394Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:27:08.3091908Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:27:08.3092389Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:27:08.3092861Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:27:08.3093607Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:27:08.3094074Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:27:08.3094365Z 
2025-05-07T20:27:08.3094585Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:27:08.3095229Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:27:08.3095645Z 
2025-05-07T20:27:08.3117142Z 
2025-05-07T20:27:08.3117549Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:27:08.3117812Z 
2025-05-07T20:27:10.2823624Z 
2025-05-07T20:27:10.2824635Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:27:10.2825685Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:27:10.2826435Z 
2025-05-07T20:27:10.7097656Z 
2025-05-07T20:27:10.7098355Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:27:10.7098837Z 
2025-05-07T20:27:12.6032668Z -allow-unsupported-compiler
2025-05-07T20:27:12.6032974Z 
2025-05-07T20:27:12.6660547Z 
2025-05-07T20:27:12.6660829Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:27:12.6661655Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:27:12.6661982Z 
2025-05-07T20:27:14.6326948Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:27:14.6327566Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:27:14.6327907Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:27:14.6328231Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:27:14.6328561Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:27:14.6328824Z #define _STL_PAIR_H 1
2025-05-07T20:27:14.6329078Z #define __cpp_attributes 200809L
2025-05-07T20:27:14.6329402Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:27:14.6329742Z #define __DELETE_THROW throw()
2025-05-07T20:27:14.6330005Z #define _PTRDIFF_T_ 
2025-05-07T20:27:14.6330271Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:27:14.6330556Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:27:14.6330829Z #define _IO_LEFT 02
2025-05-07T20:27:14.6331077Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:27:14.6331330Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:27:14.6331607Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:27:14.6332065Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:27:14.6332497Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:27:14.6332840Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:27:14.6333225Z #define _IOS_OUTPUT 2
2025-05-07T20:27:14.6333566Z #define __SM_100_RT_HPP__ 
2025-05-07T20:27:14.6334026Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:27:14.6334529Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:27:14.6334955Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:27:14.6335317Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:27:14.6335701Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:27:14.6336777Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:27:14.6337864Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:27:14.6338277Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:27:14.6338677Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:27:14.6339089Z #define _T_WCHAR_ 
2025-05-07T20:27:14.6339313Z #define stdout stdout
2025-05-07T20:27:14.6339645Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:27:14.6340025Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:27:14.6340276Z #define __flexarr []
2025-05-07T20:27:14.6340515Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:27:14.6340837Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:27:14.6341177Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:27:14.6341434Z #define _MATH_H 1
2025-05-07T20:27:14.6342010Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:27:14.6342356Z #define __S64_TYPE long int
2025-05-07T20:27:14.6342609Z #define __stub_fchflags 
2025-05-07T20:27:14.6342879Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:27:14.6343174Z #define __SQUAD_TYPE long int
2025-05-07T20:27:14.6343435Z #define __INTMAX_C(c) c ## L
2025-05-07T20:27:14.6343741Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:27:14.6344132Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:27:14.6344386Z #define NL_NMAX INT_MAX
2025-05-07T20:27:14.6344622Z #define _BITS_TIME_H 1
2025-05-07T20:27:14.6344899Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:27:14.6345223Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:27:14.6345528Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:27:14.6345880Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:27:14.6346279Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:27:14.6346645Z #define __CHAR_BIT__ 8
2025-05-07T20:27:14.6346909Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6347383Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:27:14.6347673Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:27:14.6347942Z #define FP_NAN 0
2025-05-07T20:27:14.6348205Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:27:14.6348611Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:27:14.6348999Z #define __cudaCDP2GetErrorString 
2025-05-07T20:27:14.6349287Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:27:14.6349547Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:27:14.6349809Z #define __SM_80_RT_H__ 
2025-05-07T20:27:14.6350040Z #define _NEW 
2025-05-07T20:27:14.6350264Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:27:14.6350545Z #define __UINT8_MAX__ 0xff
2025-05-07T20:27:14.6350914Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:27:14.6351323Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:27:14.6351570Z #define __USE_ANSI 1
2025-05-07T20:27:14.6351858Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:27:14.6352258Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:27:14.6352612Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:27:14.6352915Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:27:14.6353200Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:27:14.6353479Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:27:14.6353761Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:27:14.6354049Z #define PIPE_BUF 4096
2025-05-07T20:27:14.6354363Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:27:14.6354816Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:27:14.6355195Z #define ADJ_TICK 0x4000
2025-05-07T20:27:14.6355603Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:27:14.6355923Z #define MQ_PRIO_MAX 32768
2025-05-07T20:27:14.6356199Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:27:14.6356525Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:27:14.6356996Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:14.6357523Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:27:14.6357896Z #define _XOPEN_SOURCE 700
2025-05-07T20:27:14.6358156Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:27:14.6358430Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6358720Z #define __cpp_static_assert 201411L
2025-05-07T20:27:14.6359008Z #define __GLIBCXX__ 20230528
2025-05-07T20:27:14.6359275Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:27:14.6359565Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:27:14.6359846Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:27:14.6360147Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:27:14.6360431Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:27:14.6360738Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6361181Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:27:14.6361525Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:27:14.6361814Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:27:14.6362136Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6362491Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:27:14.6362860Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:27:14.6363158Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:27:14.6363449Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:27:14.6363801Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:27:14.6364164Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:27:14.6364570Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:27:14.6364979Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:27:14.6365295Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:27:14.6366556Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:27:14.6366870Z #define __GCC_IEC_559 2
2025-05-07T20:27:14.6367190Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:27:14.6367571Z #define _IO_flockfile(_fp) 
2025-05-07T20:27:14.6367979Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:27:14.6368249Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:27:14.6368514Z #define _IOFBF 0
2025-05-07T20:27:14.6368726Z #define __USE_BSD 1
2025-05-07T20:27:14.6368955Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:27:14.6369230Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:27:14.6369498Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:27:14.6369755Z #define _IO_NO_WRITES 8
2025-05-07T20:27:14.6370017Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:27:14.6370377Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:27:14.6370726Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:27:14.6371036Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:27:14.6371363Z #define __cpp_binary_literals 201304L
2025-05-07T20:27:14.6371659Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:27:14.6371932Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:27:14.6372206Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:27:14.6372521Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:27:14.6372907Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:27:14.6373277Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:27:14.6373581Z #define M_PI 3.14159265358979323846
2025-05-07T20:27:14.6373892Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:27:14.6374229Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:27:14.6374539Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:27:14.6374840Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:27:14.6375118Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:27:14.6375388Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:27:14.6375974Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:27:14.6376565Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:27:14.6376889Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:27:14.6377222Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:27:14.6377526Z #define __cudaCDP2GetErrorName 
2025-05-07T20:27:14.6377802Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:27:14.6378065Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:27:14.6378365Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:27:14.6378696Z #define __cpp_variadic_templates 200704L
2025-05-07T20:27:14.6378995Z #define RAND_MAX 2147483647
2025-05-07T20:27:14.6379257Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:27:14.6379584Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6379902Z #define __SM_90_RT_H__ 
2025-05-07T20:27:14.6380143Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:27:14.6380405Z #define __COMPAR_FN_T 
2025-05-07T20:27:14.6380654Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:27:14.6381040Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:27:14.6381520Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:27:14.6382038Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:27:14.6382376Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:27:14.6382738Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:27:14.6383040Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:14.6383378Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:27:14.6383685Z #define __cpp_variable_templates 201304L
2025-05-07T20:27:14.6384191Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:14.6384733Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:27:14.6385058Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:27:14.6385339Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:27:14.6385642Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:27:14.6385939Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:27:14.6386216Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:27:14.6386488Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:27:14.6386752Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:27:14.6387082Z #define __u_char_defined 
2025-05-07T20:27:14.6387402Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:27:14.6387764Z #define STA_PPSERROR 0x0800
2025-05-07T20:27:14.6388019Z #define _GLIBCXX_STD_A std
2025-05-07T20:27:14.6388275Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:27:14.6388559Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:27:14.6388990Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:27:14.6389415Z #define FP_INFINITE 1
2025-05-07T20:27:14.6389782Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:14.6390193Z #define _IO_pid_t __pid_t
2025-05-07T20:27:14.6390451Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:27:14.6390720Z #define __LEAF , __leaf__
2025-05-07T20:27:14.6390965Z #define PATH_MAX 4096
2025-05-07T20:27:14.6391224Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:27:14.6391564Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:27:14.6391895Z #define _LIMITS_H___ 
2025-05-07T20:27:14.6392116Z #define __size_t 
2025-05-07T20:27:14.6392350Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:27:14.6392889Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:27:14.6393449Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:27:14.6393760Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:27:14.6394095Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:27:14.6394356Z #define _WCHAR_T_DEFINED 
2025-05-07T20:27:14.6394711Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:27:14.6395111Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:27:14.6395476Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:27:14.6395805Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:27:14.6396090Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:27:14.6396373Z #define __INT8_C(c) c
2025-05-07T20:27:14.6396638Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:27:14.6396940Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:27:14.6397204Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:27:14.6397461Z #define __SM_70_RT_HPP__ 
2025-05-07T20:27:14.6397715Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:27:14.6397988Z #define __cpp_variadic_using 201611L
2025-05-07T20:27:14.6398308Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6398636Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:27:14.6398909Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:27:14.6399178Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:27:14.6399447Z #define __cpp_capture_star_this 201603L
2025-05-07T20:27:14.6399762Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:27:14.6400065Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:27:14.6400503Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:27:14.6400885Z #define NFDBITS __NFDBITS
2025-05-07T20:27:14.6401146Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:27:14.6401437Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:27:14.6401762Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:27:14.6402083Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:27:14.6402338Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:27:14.6402630Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:27:14.6402937Z #define STA_UNSYNC 0x0040
2025-05-07T20:27:14.6403246Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:14.6403661Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:27:14.6404027Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:27:14.6404318Z #define __cpp_if_constexpr 201606L
2025-05-07T20:27:14.6404629Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:27:14.6404961Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:27:14.6405286Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:27:14.6405616Z #define __daddr_t_defined 
2025-05-07T20:27:14.6405872Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:27:14.6406266Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:27:14.6406576Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:27:14.6407091Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:27:14.6407578Z #define _ACRTIMP 
2025-05-07T20:27:14.6407801Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:27:14.6408074Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:27:14.6408371Z #define _IOS_BIN 128
2025-05-07T20:27:14.6408721Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:27:14.6409130Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6409401Z #define UNDERFLOW 4
2025-05-07T20:27:14.6409625Z #define NAME_MAX 255
2025-05-07T20:27:14.6409858Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:27:14.6410873Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:27:14.6411572Z 
2025-05-07T20:27:14.6411677Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:27:14.6411961Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:27:14.6412251Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:27:14.6412632Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:27:14.6413026Z #define __ptr_t void *
2025-05-07T20:27:14.6413262Z #define M_E 2.7182818284590452354
2025-05-07T20:27:14.6413543Z #define cudaSurfaceType1D 0x01
2025-05-07T20:27:14.6413814Z #define __USE_ISOCXX11 1
2025-05-07T20:27:14.6414076Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:27:14.6414394Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:27:14.6414692Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:27:14.6414966Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:27:14.6415263Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:27:14.6415582Z #define cudaSurfaceType2D 0x02
2025-05-07T20:27:14.6415849Z #define __linux 1
2025-05-07T20:27:14.6416092Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:27:14.6416363Z #define cudaDeviceMask 0xff
2025-05-07T20:27:14.6416636Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:27:14.6416934Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:27:14.6417210Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:27:14.6417505Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:27:14.6417820Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:27:14.6418129Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:27:14.6418420Z #define _BITS_TYPES_H 1
2025-05-07T20:27:14.6418711Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:27:14.6419058Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:27:14.6419357Z #define cudaSurfaceType3D 0x03
2025-05-07T20:27:14.6419636Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:27:14.6420015Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:27:14.6420303Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:27:14.6421085Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:27:14.6421905Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:27:14.6471827Z #define __unix 1
2025-05-07T20:27:14.6472123Z #define MATH_ERRNO 1
2025-05-07T20:27:14.6472371Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:27:14.6472653Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:27:14.6472913Z #define __SM_100_RT_H__ 
2025-05-07T20:27:14.6473174Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:27:14.6473469Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:27:14.6473757Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:27:14.6474038Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:27:14.6474336Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:27:14.6474812Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:27:14.6475622Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:27:14.6475918Z #define CUDARTAPI_CDECL 
2025-05-07T20:27:14.6476186Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:27:14.6476457Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:27:14.6476732Z #define __cpp_lib_void_t 201411
2025-05-07T20:27:14.6476989Z #define _POSIX_AIO_MAX 1
2025-05-07T20:27:14.6477220Z #define __SIZE_T 
2025-05-07T20:27:14.6477464Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:27:14.6477780Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:27:14.6478070Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:27:14.6478328Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:27:14.6478585Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:27:14.6478848Z #define _ATFILE_SOURCE 1
2025-05-07T20:27:14.6479230Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:27:14.6479663Z #define __WAIT_STATUS void *
2025-05-07T20:27:14.6479926Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:27:14.6480194Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:27:14.6480463Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:27:14.6480751Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:27:14.6481029Z #define __WINT_MIN__ 0U
2025-05-07T20:27:14.6481596Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:27:14.6482240Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:27:14.6482542Z #define WUNTRACED 2
2025-05-07T20:27:14.6482773Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:27:14.6483045Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:27:14.6483336Z #define NZERO 20
2025-05-07T20:27:14.6483564Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:27:14.6483864Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:27:14.6484187Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:27:14.6484477Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:27:14.6484728Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:27:14.6485016Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:27:14.6485290Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:27:14.6485561Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:27:14.6485837Z #define EXIT_FAILURE 1
2025-05-07T20:27:14.6486078Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:27:14.6486332Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:27:14.6486598Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:27:14.6486848Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:27:14.6487127Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:27:14.6487460Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:27:14.6487819Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:27:14.6488114Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:27:14.6488363Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:27:14.6488635Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:27:14.6489080Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:27:14.6489383Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:27:14.6489677Z #define SEEK_DATA 3
2025-05-07T20:27:14.6489908Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:27:14.6490198Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:27:14.6490617Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:27:14.6491004Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:27:14.6491254Z #define __INT64_C(c) c ## L
2025-05-07T20:27:14.6491517Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:27:14.6491849Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:27:14.6492169Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:27:14.6492440Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:27:14.6492733Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:27:14.6493030Z #define STA_PPSWANDER 0x0400
2025-05-07T20:27:14.6493280Z #define __INT_WCHAR_T_H 
2025-05-07T20:27:14.6493519Z #define WSTOPPED 2
2025-05-07T20:27:14.6493778Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:27:14.6494093Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:27:14.6494428Z #define FP_NORMAL 4
2025-05-07T20:27:14.6494670Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:27:14.6494949Z #define _BITS_TIMEX_H 1
2025-05-07T20:27:14.6495186Z #define _POSIX_LINK_MAX 8
2025-05-07T20:27:14.6495443Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:27:14.6495720Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:27:14.6495992Z #define cudaTextureType1D 0x01
2025-05-07T20:27:14.6496269Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:27:14.6496535Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:27:14.6496798Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:27:14.6497094Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:27:14.6497520Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:27:14.6497964Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:27:14.6498226Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:27:14.6498493Z #define _POSIX_SOURCE 1
2025-05-07T20:27:14.6498736Z #define cudaTextureType2D 0x02
2025-05-07T20:27:14.6498996Z #define _PTR_TRAITS_H 1
2025-05-07T20:27:14.6499268Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:27:14.6499576Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:27:14.6499842Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:27:14.6500163Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:27:14.6500493Z #define cudaTextureType3D 0x03
2025-05-07T20:27:14.6500756Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:27:14.6501014Z #define CLOCK_REALTIME 0
2025-05-07T20:27:14.6501260Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:27:14.6501527Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:27:14.6501831Z #define __cpp_aligned_new 201606L
2025-05-07T20:27:14.6502108Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:27:14.6502385Z #define cudaEventBlockingSync 0x01
2025-05-07T20:27:14.6502671Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:27:14.6502948Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:27:14.6503251Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:27:14.6503549Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:27:14.6503835Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:27:14.6504084Z #define __GLIBC__ 2
2025-05-07T20:27:14.6504307Z #define __END_DECLS }
2025-05-07T20:27:14.6504548Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:27:14.6504907Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:27:14.6505280Z #define __CONCAT(x,y) x ## y
2025-05-07T20:27:14.6505531Z #define WCONTINUED 8
2025-05-07T20:27:14.6505764Z #define __STDC_HOSTED__ 1
2025-05-07T20:27:14.6506014Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:27:14.6506284Z #define _ALLOCA_H 1
2025-05-07T20:27:14.6506520Z #define __host__ __location__(host)
2025-05-07T20:27:14.6506941Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:27:14.6507382Z #define __SLONG32_TYPE int
2025-05-07T20:27:14.6507805Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:27:14.6508090Z #define _SYS_SELECT_H 1
2025-05-07T20:27:14.6508338Z #define _IO_LINE_BUF 0x200
2025-05-07T20:27:14.6508593Z #define _IOS_NOCREATE 32
2025-05-07T20:27:14.6508838Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:27:14.6509113Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:27:14.6509411Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:27:14.6509703Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:27:14.6509987Z #define __global__ __location__(global)
2025-05-07T20:27:14.6510274Z #define __GNU_LIBRARY__ 6
2025-05-07T20:27:14.6510533Z #define __cpp_decltype_auto 201304L
2025-05-07T20:27:14.6510805Z #define __DBL_DIG__ 15
2025-05-07T20:27:14.6511031Z #define TIME_UTC 1
2025-05-07T20:27:14.6511249Z #define __FLT32_DIG__ 6
2025-05-07T20:27:14.6511563Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:27:14.6511956Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:27:14.6512280Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:27:14.6512581Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:27:14.6512884Z #define _G_BUFSIZ 8192
2025-05-07T20:27:14.6513273Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:27:14.6513640Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:27:14.6513983Z #define __cudaCDP2GetDevice 
2025-05-07T20:27:14.6514265Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:27:14.6514551Z #define STA_CLOCKERR 0x1000
2025-05-07T20:27:14.6514794Z #define __GXX_WEAK__ 1
2025-05-07T20:27:14.6515050Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:14.6515353Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:27:14.6515656Z #define __SHRT_WIDTH__ 16
2025-05-07T20:27:14.6515951Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:27:14.6516291Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:27:14.6516564Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:27:14.6516847Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:27:14.6517148Z #define _G_config_h 1
2025-05-07T20:27:14.6517417Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:27:14.6517759Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:27:14.6518037Z #define _GCC_WCHAR_T 
2025-05-07T20:27:14.6518268Z #define TMP_MAX 238328
2025-05-07T20:27:14.6518505Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:27:14.6518767Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:27:14.6519025Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:14.6519295Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:27:14.6519567Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:27:14.6519852Z #define _IO_SKIPWS 01
2025-05-07T20:27:14.6520246Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:27:14.6520702Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:27:14.6520965Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:27:14.6521289Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:27:14.6521658Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:27:14.6522024Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:27:14.6522379Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:27:14.6522638Z #define le32toh(x) (x)
2025-05-07T20:27:14.6522875Z #define _SIZE_T_DEFINED 
2025-05-07T20:27:14.6523127Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:27:14.6523458Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:27:14.6523811Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:27:14.6524256Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:27:14.6524664Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:27:14.6524930Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:27:14.6525197Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:27:14.6525460Z #define _POSIX_NAME_MAX 14
2025-05-07T20:27:14.6525737Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:27:14.6526347Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:27:14.6526849Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:27:14.6527153Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:27:14.6527504Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:27:14.6527821Z #define _WCHAR_T_ 
2025-05-07T20:27:14.6528046Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:27:14.6528407Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:27:14.6528796Z #define RTSIG_MAX 32
2025-05-07T20:27:14.6529017Z #define _STDDEF_H 
2025-05-07T20:27:14.6529247Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:27:14.6529516Z #define _VA_LIST_DEFINED 
2025-05-07T20:27:14.6529764Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:27:14.6530097Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:27:14.6530488Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:27:14.6530810Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:27:14.6531110Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:27:14.6531570Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:27:14.6532176Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:27:14.6532539Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:27:14.6532858Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:27:14.6533172Z #define __unix__ 1
2025-05-07T20:27:14.6533402Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:14.6533684Z #define __INT_WIDTH__ 32
2025-05-07T20:27:14.6533971Z #define __SIZEOF_LONG__ 8
2025-05-07T20:27:14.6534212Z #define _IONBF 2
2025-05-07T20:27:14.6534661Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:27:14.6535423Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:27:14.6535958Z #define __STDC_IEC_559__ 1
2025-05-07T20:27:14.6536213Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:27:14.6536479Z #define __UINT16_C(c) c
2025-05-07T20:27:14.6536722Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:27:14.6536997Z #define STA_DEL 0x0020
2025-05-07T20:27:14.6537242Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:27:14.6537498Z #define __id_t_defined 
2025-05-07T20:27:14.6537766Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:27:14.6538216Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:27:14.6538643Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:27:14.6538913Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:27:14.6539168Z #define __DECIMAL_DIG__ 21
2025-05-07T20:27:14.6539423Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:27:14.6539693Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:27:14.6539952Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:27:14.6540217Z #define SING 2
2025-05-07T20:27:14.6540435Z #define STA_FREQHOLD 0x0080
2025-05-07T20:27:14.6540702Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6541005Z #define cudaStreamDefault 0x00
2025-05-07T20:27:14.6541356Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:27:14.6541727Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:27:14.6541995Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:27:14.6542266Z #define __gnu_linux__ 1
2025-05-07T20:27:14.6542503Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:27:14.6542753Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:27:14.6543042Z #define MAX_INPUT 255
2025-05-07T20:27:14.6543287Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:27:14.6543608Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:27:14.6543978Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:27:14.6544293Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:27:14.6544558Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:27:14.6544954Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:27:14.6545381Z #define _IO_SHOWPOS 02000
2025-05-07T20:27:14.6545817Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:27:14.6546184Z #define _Mfloat_ float
2025-05-07T20:27:14.6546455Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:27:14.6546767Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:27:14.6547049Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:27:14.6547371Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:27:14.6547913Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:27:14.6548405Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6548685Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:27:14.6549014Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:27:14.6549365Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:27:14.6549661Z #define __USE_ISOC11 1
2025-05-07T20:27:14.6549890Z #define _BSD_SIZE_T_ 
2025-05-07T20:27:14.6550119Z #define ADJ_MICRO 0x1000
2025-05-07T20:27:14.6550377Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:27:14.6550644Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:27:14.6550946Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:27:14.6551345Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:27:14.6551652Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:27:14.6551978Z #define __THROW throw ()
2025-05-07T20:27:14.6552230Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:27:14.6552521Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6552872Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:27:14.6553222Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:27:14.6553497Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:27:14.6553755Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:27:14.6554047Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:27:14.6554331Z #define L_tmpnam 20
2025-05-07T20:27:14.6554554Z #define ___int_wchar_t_h 
2025-05-07T20:27:14.6554905Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:27:14.6555283Z #define isascii(c) __isascii (c)
2025-05-07T20:27:14.6555603Z #define _T_PTRDIFF 
2025-05-07T20:27:14.6555911Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:27:14.6556266Z #define toascii(c) __toascii (c)
2025-05-07T20:27:14.6556525Z #define __GNUC__ 11
2025-05-07T20:27:14.6556778Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:27:14.6557074Z #define __GXX_RTTI 1
2025-05-07T20:27:14.6557297Z #define __pie__ 2
2025-05-07T20:27:14.6557510Z #define __MMX__ 1
2025-05-07T20:27:14.6557728Z #define __cudaCDP2Malloc 
2025-05-07T20:27:14.6557984Z #define __timespec_defined 1
2025-05-07T20:27:14.6558234Z #define L_ctermid 9
2025-05-07T20:27:14.6558462Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:14.6558767Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:27:14.6559153Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:27:14.6559521Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:27:14.6559794Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:27:14.6560083Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:27:14.6560386Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:27:14.6560699Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:27:14.6560963Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:27:14.6561397Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:27:14.6562137Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:14.6562737Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:27:14.6563044Z #define __USE_SVID 1
2025-05-07T20:27:14.6563291Z #define __constant__ __location__(constant)
2025-05-07T20:27:14.6563606Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:27:14.6563906Z #define __device__ __location__(device)
2025-05-07T20:27:14.6564231Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:27:14.6564640Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:27:14.6564910Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:27:14.6565195Z #define CUDART_DEVICE __device__
2025-05-07T20:27:14.6565835Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:27:14.6566227Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:27:14.6566511Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:27:14.6566869Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:27:14.6567246Z #define __STDC_UTF_16__ 1
2025-05-07T20:27:14.6567494Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:27:14.6567853Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:27:14.6568276Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:27:14.6568590Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:27:14.6568861Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:27:14.6569123Z #define NGROUPS_MAX 65536
2025-05-07T20:27:14.6569376Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:27:14.6569645Z #define __USE_ISOC95 1
2025-05-07T20:27:14.6569868Z #define _TIME_H 1
2025-05-07T20:27:14.6570132Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:27:14.6570600Z #define __USE_ISOC99 1
2025-05-07T20:27:14.6570917Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:27:14.6571286Z #define HOST_NAME_MAX 64
2025-05-07T20:27:14.6571533Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:27:14.6571784Z #define _IOS_ATEND 4
2025-05-07T20:27:14.6572017Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:27:14.6572339Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:27:14.6572738Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:14.6573073Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:27:14.6573355Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:27:14.6573673Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:27:14.6573979Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:27:14.6574241Z #define _STDIO_H 1
2025-05-07T20:27:14.6574631Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:27:14.6575098Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:27:14.6575457Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:14.6575832Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:27:14.6576118Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:27:14.6576392Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:27:14.6576663Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:27:14.6576955Z #define __cpp_raw_strings 200710L
2025-05-07T20:27:14.6577251Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6577566Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:27:14.6577838Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:27:14.6578116Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:27:14.6578420Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:27:14.6578696Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:27:14.6578984Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:27:14.6579337Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:27:14.6579710Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:27:14.6579950Z #define __USE_XOPEN 1
2025-05-07T20:27:14.6580194Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:27:14.6580631Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:14.6581073Z #define __USE_XOPEN2K 1
2025-05-07T20:27:14.6581314Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:27:14.6581586Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:27:14.6581886Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:27:14.6582153Z #define __cpp_fold_expressions 201603L
2025-05-07T20:27:14.6582667Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:27:14.6583191Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:27:14.6583470Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:27:14.6583995Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:27:14.6584394Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:27:14.6584768Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:27:14.6585161Z #define __END_NAMESPACE_C99 
2025-05-07T20:27:14.6585435Z #define __glibcxx_integral_traps true
2025-05-07T20:27:14.6585725Z #define _POSIX_PATH_MAX 256
2025-05-07T20:27:14.6585980Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:27:14.6586236Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:27:14.6586503Z #define _IOS_TRUNC 16
2025-05-07T20:27:14.6586732Z #define _ISOC11_SOURCE 1
2025-05-07T20:27:14.6586985Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:27:14.6587278Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:27:14.6587573Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:27:14.6587934Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:27:14.6588314Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:27:14.6588592Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:27:14.6588855Z #define _IO_UNITBUF 020000
2025-05-07T20:27:14.6589111Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:27:14.6589457Z #define __FD_SETSIZE 1024
2025-05-07T20:27:14.6597012Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:27:14.6597331Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:27:14.6597681Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:27:14.6598043Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:27:14.6598306Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:27:14.6598621Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:27:14.6598944Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:27:14.6599214Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:27:14.6599520Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:27:14.6599852Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:27:14.6600139Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:27:14.6600473Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:27:14.6600765Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:27:14.6601037Z #define __USE_POSIX199506 1
2025-05-07T20:27:14.6601282Z #define _FEATURES_H 1
2025-05-07T20:27:14.6601521Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:27:14.6601905Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:27:14.6602378Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:27:14.6602704Z #define __stub_getmsg 
2025-05-07T20:27:14.6602930Z #define _IO_FIXED 010000
2025-05-07T20:27:14.6603198Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:27:14.6603509Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:27:14.6603785Z #define __stub_setlogin 
2025-05-07T20:27:14.6604059Z #define __stub_fattach 
2025-05-07T20:27:14.6604297Z #define __cplusplus 201703L
2025-05-07T20:27:14.6604558Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:27:14.6604839Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:27:14.6605092Z #define INFINITY (__builtin_inff())
2025-05-07T20:27:14.6605364Z #define _IO_UNBUFFERED 2
2025-05-07T20:27:14.6605842Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:27:14.6606364Z #define _IO_INTERNAL 010
2025-05-07T20:27:14.6606609Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:27:14.6606934Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:14.6607283Z #define __dev_t_defined 
2025-05-07T20:27:14.6607522Z #define __DEPRECATED 1
2025-05-07T20:27:14.6607750Z #define __S32_TYPE int
2025-05-07T20:27:14.6607998Z #define __cpp_rvalue_references 200610L
2025-05-07T20:27:14.6608290Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:27:14.6608542Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:27:14.6608795Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:27:14.6609389Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:27:14.6610009Z #define _G_HAVE_MREMAP 1
2025-05-07T20:27:14.6610460Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:14.6610804Z #define OVERFLOW 3
2025-05-07T20:27:14.6611050Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:27:14.6611357Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:27:14.6611642Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:14.6611972Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:27:14.6612295Z #define __SSE2_MATH__ 1
2025-05-07T20:27:14.6612535Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:27:14.6612839Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:14.6613133Z #define _IO_STDIO_H 
2025-05-07T20:27:14.6613372Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:27:14.6613662Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:27:14.6614001Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:27:14.6614317Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6614625Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:27:14.6614887Z #define __amd64 1
2025-05-07T20:27:14.6615104Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:27:14.6615374Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:27:14.6615652Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:27:14.6616022Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:27:14.6616326Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:27:14.6616589Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:27:14.6616881Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:27:14.6617142Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:27:14.6617389Z #define __bounded 
2025-05-07T20:27:14.6617606Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:27:14.6617871Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:27:14.6618156Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:27:14.6618429Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:27:14.6618691Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:27:14.6618962Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6619273Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:27:14.6619686Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:14.6620082Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:27:14.6620347Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:27:14.6620685Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:27:14.6621026Z #define STA_PLL 0x0001
2025-05-07T20:27:14.6621268Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:27:14.6621535Z #define __GNUG__ 11
2025-05-07T20:27:14.6621770Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:27:14.6622027Z #define _T_WCHAR 
2025-05-07T20:27:14.6622256Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:27:14.6622545Z #define __specialization_static 
2025-05-07T20:27:14.6622841Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:27:14.6623146Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:27:14.6623403Z #define cudaArraySparse 0x40
2025-05-07T20:27:14.6623664Z #define STA_PPSFREQ 0x0002
2025-05-07T20:27:14.6623937Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:27:14.6624230Z #define _WCHAR_T 
2025-05-07T20:27:14.6624454Z #define __cudaCDP2Free 
2025-05-07T20:27:14.6625092Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:27:14.6625773Z #define __cpp_nsdmi 200809L
2025-05-07T20:27:14.6626179Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:27:14.6626614Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:27:14.6626883Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:27:14.6627146Z #define cudaArrayCubemap 0x04
2025-05-07T20:27:14.6627469Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:14.6627821Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:27:14.6628054Z #define __NO_CTYPE 1
2025-05-07T20:27:14.6628280Z #define __stub_bdflush 
2025-05-07T20:27:14.6628629Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:27:14.6629129Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:27:14.6629429Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:27:14.6629695Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:27:14.6629966Z #define __cpp_initializer_lists 200806L
2025-05-07T20:27:14.6630269Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:27:14.6630565Z #define __U16_TYPE unsigned short int
2025-05-07T20:27:14.6630888Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:27:14.6631231Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:27:14.6631507Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:27:14.6631785Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:27:14.6632116Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:27:14.6632454Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:27:14.6632731Z #define _IO_STDIO 040000
2025-05-07T20:27:14.6633042Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:27:14.6633421Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:27:14.6633741Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:27:14.6634062Z #define _PTRDIFF_T 
2025-05-07T20:27:14.6634282Z #define _MOVE_H 1
2025-05-07T20:27:14.6634587Z #define __cpp_hex_float 201603L
2025-05-07T20:27:14.6634838Z #define ADJ_TAI 0x0080
2025-05-07T20:27:14.6635062Z #define __ptrvalue 
2025-05-07T20:27:14.6635277Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:27:14.6635577Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:27:14.6635853Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:27:14.6636148Z #define MATH_ERREXCEPT 2
2025-05-07T20:27:14.6636394Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:27:14.6636668Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:27:14.6637055Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:27:14.6637430Z #define __USE_GNU 1
2025-05-07T20:27:14.6637655Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:27:14.6637924Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:27:14.6638190Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:27:14.6638571Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:27:14.6638966Z #define WEXITED 4
2025-05-07T20:27:14.6639180Z #define _IO_NO_READS 4
2025-05-07T20:27:14.6639472Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:27:14.6639817Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:27:14.6640094Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:27:14.6640390Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:27:14.6640696Z #define __uid_t_defined 
2025-05-07T20:27:14.6640945Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:27:14.6641231Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:27:14.6641496Z #define WNOHANG 1
2025-05-07T20:27:14.6641739Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:27:14.6642043Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:27:14.6642311Z #define cudaEventDefault 0x00
2025-05-07T20:27:14.6642606Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:27:14.6642924Z #define NL_SETMAX INT_MAX
2025-05-07T20:27:14.6643151Z #define __x86_64 1
2025-05-07T20:27:14.6643379Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:27:14.6643778Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:14.6644287Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:27:14.6644785Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:14.6645212Z #define __PTRDIFF_T 
2025-05-07T20:27:14.6645532Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:27:14.6645900Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:27:14.6646170Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:14.6646456Z #define _Mlong_double_ long double
2025-05-07T20:27:14.6646732Z #define __cpp_lambdas 200907L
2025-05-07T20:27:14.6646979Z #define _IO_DEC 020
2025-05-07T20:27:14.6647203Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:27:14.6647551Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:27:14.6647840Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:27:14.6648118Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:27:14.6648374Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:27:14.6648669Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:27:14.6648987Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:27:14.6649252Z #define _ANSI_STDDEF_H 
2025-05-07T20:27:14.6649510Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:27:14.6649818Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:27:14.6650177Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:27:14.6650551Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:27:14.6650828Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:27:14.6651117Z #define __cpp_template_auto 201606L
2025-05-07T20:27:14.6651466Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:27:14.6651832Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:27:14.6652105Z #define __key_t_defined 
2025-05-07T20:27:14.6652349Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:27:14.6652711Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:27:14.6653258Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:27:14.6653620Z #define __GNUC_VA_LIST 
2025-05-07T20:27:14.6653947Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:14.6654325Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:27:14.6654586Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:27:14.6654856Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:27:14.6655144Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:27:14.6655389Z #define __WCOREFLAG 0x80
2025-05-07T20:27:14.6655633Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:27:14.6655939Z #define cudaEventDisableTiming 0x02
2025-05-07T20:27:14.6656215Z #define __LP64__ 1
2025-05-07T20:27:14.6656456Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:27:14.6656776Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:27:14.6657055Z #define _IO_off64_t __off64_t
2025-05-07T20:27:14.6657309Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6657574Z #define __time_t_defined 1
2025-05-07T20:27:14.6657821Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:27:14.6658160Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:27:14.6658522Z #define __USE_UNIX98 1
2025-05-07T20:27:14.6658768Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:27:14.6659029Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:27:14.6659296Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:27:14.6659592Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:27:14.6659898Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:27:14.6660154Z #define SEEK_CUR 1
2025-05-07T20:27:14.6660380Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:14.6660647Z #define _ASSERT_H 1
2025-05-07T20:27:14.6661211Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:27:14.6661832Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:27:14.6662110Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:27:14.6662354Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:27:14.6662623Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:27:14.6662892Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:27:14.6663256Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:14.6663659Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:27:14.6664362Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:27:14.6665006Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:27:14.6665293Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:27:14.6665956Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:27:14.6667778Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:27:14.6668055Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:27:14.6668333Z #define cudaArrayDefault 0x00
2025-05-07T20:27:14.6668615Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:27:14.6668900Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:27:14.6669174Z #define TLOSS 5
2025-05-07T20:27:14.6669387Z #define __ssize_t_defined 
2025-05-07T20:27:14.6669630Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:27:14.6669897Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:27:14.6670183Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:27:14.6670460Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:27:14.6670733Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:27:14.6671014Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:27:14.6671318Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:27:14.6671604Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:27:14.6671887Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:27:14.6672164Z #define __REGISTER_PREFIX__ 
2025-05-07T20:27:14.6672422Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:27:14.6672748Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:27:14.6673235Z #define _IOS_NOREPLACE 64
2025-05-07T20:27:14.6673466Z #define __cdecl 
2025-05-07T20:27:14.6673700Z #define cudaEventInterprocess 0x04
2025-05-07T20:27:14.6674021Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:27:14.6674342Z #define LOGIN_NAME_MAX 256
2025-05-07T20:27:14.6674586Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:27:14.6674849Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:27:14.6675141Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:27:14.6675457Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:27:14.6675761Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:27:14.6676086Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:27:14.6676481Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:14.6676906Z #define ADJ_NANO 0x2000
2025-05-07T20:27:14.6677209Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:27:14.6677559Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:27:14.6677848Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:27:14.6678107Z #define __FLT_DIG__ 6
2025-05-07T20:27:14.6678447Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:27:14.6678834Z #define __NO_INLINE__ 1
2025-05-07T20:27:14.6679132Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:14.6679475Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:27:14.6679725Z #define ADJ_STATUS 0x0010
2025-05-07T20:27:14.6679983Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:27:14.6680266Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:27:14.6680527Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:14.6680820Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:27:14.6681103Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:27:14.6681477Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:27:14.6681889Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:27:14.6682229Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:27:14.6682578Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:27:14.6682814Z #define MAX_CANON 255
2025-05-07T20:27:14.6683043Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:27:14.6683294Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:27:14.6683555Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:27:14.6683856Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:27:14.6684189Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:27:14.6684478Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:27:14.6684749Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:27:14.6685067Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:27:14.6685371Z #define __VERSION__ "11.4.0"
2025-05-07T20:27:14.6685626Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:27:14.6685914Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:27:14.6686305Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:27:14.6686582Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:27:14.6686891Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:27:14.6687190Z #define __UINT64_C(c) c ## UL
2025-05-07T20:27:14.6687441Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:27:14.6687688Z #define _SYS_TYPES_H 1
2025-05-07T20:27:14.6687918Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:27:14.6688168Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:27:14.6688414Z #define _SYS_CDEFS_H 1
2025-05-07T20:27:14.6688645Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:27:14.6688910Z #define __cpp_unicode_characters 201411L
2025-05-07T20:27:14.6689197Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:27:14.6689449Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:27:14.6689733Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:27:14.6689996Z #define FP_SUBNORMAL 3
2025-05-07T20:27:14.6690242Z #define cudaOccupancyDefault 0x00
2025-05-07T20:27:14.6690513Z #define _INITIALIZER_LIST 
2025-05-07T20:27:14.6690763Z #define _STDC_PREDEF_H 1
2025-05-07T20:27:14.6691015Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:27:14.6691296Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:27:14.6691628Z #define _IO_file_flags _flags
2025-05-07T20:27:14.6691883Z #define __USE_XOPEN2K8 1
2025-05-07T20:27:14.6692126Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:27:14.6692396Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:27:14.6692669Z #define HUGE 3.40282347e+38F
2025-05-07T20:27:14.6692929Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:27:14.6693296Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:27:14.6693682Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:27:14.6693987Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:27:14.6694249Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:27:14.6694501Z #define _BSD_SOURCE 1
2025-05-07T20:27:14.6694737Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:27:14.6695563Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:27:14.6696393Z #define __catch(X) catch(X)
2025-05-07T20:27:14.6696648Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:27:14.6696934Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:27:14.6697202Z #define __TIMER_T_TYPE void *
2025-05-07T20:27:14.6697447Z #define __STRING(x) #x
2025-05-07T20:27:14.6697681Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:27:14.6697945Z #define _T_PTRDIFF_ 
2025-05-07T20:27:14.6698188Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:27:14.6698487Z #define cudaEventWaitExternal 0x01
2025-05-07T20:27:14.6698752Z #define __unbounded 
2025-05-07T20:27:14.6698987Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:14.6699267Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:27:14.6699537Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:14.6699824Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:27:14.6700099Z #define __cpp_lib_is_final 201402L
2025-05-07T20:27:14.6700388Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:27:14.6700710Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:27:14.6701009Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:27:14.6701287Z #define __managed__ __location__(managed)
2025-05-07T20:27:14.6701578Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:27:14.6701976Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:14.6702391Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:27:14.6702640Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:27:14.6703004Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:27:14.6703401Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:27:14.6703646Z #define _SYS_SIZE_T_H 
2025-05-07T20:27:14.6703953Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:27:14.6704312Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:27:14.6704666Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:27:14.6704953Z #define _CRTIMP 
2025-05-07T20:27:14.6705175Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:27:14.6705477Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:14.6705795Z #define STA_PPSJITTER 0x0200
2025-05-07T20:27:14.6706139Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:27:14.6706541Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6706850Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:27:14.6707121Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:27:14.6707405Z #define __SIZE_T__ 
2025-05-07T20:27:14.6707610Z #define __stub_gtty 
2025-05-07T20:27:14.6707840Z #define __pid_t_defined 
2025-05-07T20:27:14.6708093Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:27:14.6708383Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:14.6708689Z #define __glibcxx_function_requires(...) 
2025-05-07T20:27:14.6708976Z #define __SM_80_RT_HPP__ 
2025-05-07T20:27:14.6709218Z #define __need_clockid_t 
2025-05-07T20:27:14.6709454Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:27:14.6709708Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:27:14.6710107Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:27:14.6710414Z #define _IO_HEX 0100
2025-05-07T20:27:14.6710670Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:27:14.6710997Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:27:14.6711095Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:27:14.6711195Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:27:14.6711415Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:14.6711531Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:27:14.6711640Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:27:14.6711742Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:27:14.6711847Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:27:14.6711952Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:27:14.6712038Z #define __stub_sstk 
2025-05-07T20:27:14.6712130Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:27:14.6712286Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:27:14.6712373Z #define __wur 
2025-05-07T20:27:14.6712489Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:27:14.6712581Z #define _G_HAVE_MMAP 1
2025-05-07T20:27:14.6712663Z #define _IO_OCT 040
2025-05-07T20:27:14.6712755Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:27:14.6712848Z #define NL_MSGMAX INT_MAX
2025-05-07T20:27:14.6712939Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:27:14.6713068Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:27:14.6713158Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:27:14.6713260Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:27:14.6713450Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:27:14.6713601Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:27:14.6713762Z #define _STL_ALGOBASE_H 1
2025-05-07T20:27:14.6713943Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:27:14.6718361Z #define __off64_t_defined 
2025-05-07T20:27:14.6718484Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:27:14.6718587Z #define __FLT128_DIG__ 33
2025-05-07T20:27:14.6718697Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:27:14.6718794Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:27:14.6718883Z #define __INT32_C(c) c
2025-05-07T20:27:14.6718978Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:27:14.6719074Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:27:14.6719177Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:27:14.6719273Z #define __PDP_ENDIAN 3412
2025-05-07T20:27:14.6719360Z #define _ISOC95_SOURCE 1
2025-05-07T20:27:14.6719463Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:27:14.6719594Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:27:14.6719689Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:27:14.6719783Z #define __SM_90_RT_HPP__ 
2025-05-07T20:27:14.6719880Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:27:14.6720083Z #define __have_pthread_attr_t 1
2025-05-07T20:27:14.6720186Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:27:14.6720412Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:27:14.6720528Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:27:14.6720636Z #define __cudaCDP2EventRecord 
2025-05-07T20:27:14.6720732Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:27:14.6720816Z #define htole32(x) (x)
2025-05-07T20:27:14.6721068Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:27:14.6721185Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:27:14.6721286Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:27:14.6721443Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:27:14.6721580Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:27:14.6721701Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:27:14.6721838Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:27:14.6721936Z #define ADJ_OFFSET 0x0001
2025-05-07T20:27:14.6722040Z #define cudaArrayLayered 0x01
2025-05-07T20:27:14.6722203Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:27:14.6722391Z #define cudaEventRecordDefault 0x00
2025-05-07T20:27:14.6722491Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:27:14.6722589Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:27:14.6722669Z #define unix 1
2025-05-07T20:27:14.6722765Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:27:14.6722855Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:27:14.6722948Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:27:14.6723067Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:27:14.6723151Z #define __USE_POSIX 1
2025-05-07T20:27:14.6723243Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:27:14.6723375Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:27:14.6723465Z #define __THROWNL throw ()
2025-05-07T20:27:14.6723567Z #define __cpp_rtti 199711L
2025-05-07T20:27:14.6723674Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:27:14.6723762Z #define __PMT(args) args
2025-05-07T20:27:14.6723876Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6724027Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:27:14.6724136Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:27:14.6724228Z #define _SIZE_T_DECLARED 
2025-05-07T20:27:14.6724325Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:27:14.6724416Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:27:14.6724803Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:27:14.6724901Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:27:14.6724995Z #define XATTR_LIST_MAX 65536
2025-05-07T20:27:14.6725088Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:27:14.6725227Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:27:14.6725313Z #define _WCHAR_T_H 
2025-05-07T20:27:14.6725401Z #define __FLT64X_DIG__ 18
2025-05-07T20:27:14.6725494Z #define _IO_SHOWBASE 0200
2025-05-07T20:27:14.6725587Z #define _POSIX_QLIMIT 1
2025-05-07T20:27:14.6725685Z #define __INT8_TYPE__ signed char
2025-05-07T20:27:14.6725789Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:27:14.6725879Z #define __CUDA_ARCH__ 520
2025-05-07T20:27:14.6725984Z #define __cpp_digit_separators 201309L
2025-05-07T20:27:14.6726066Z #define __ELF__ 1
2025-05-07T20:27:14.6726170Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:27:14.6726267Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:27:14.6726356Z #define STA_INS 0x0010
2025-05-07T20:27:14.6726453Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:27:14.6726619Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:27:14.6726716Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:27:14.6726811Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:27:14.6726920Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6727029Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6727210Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:27:14.6727313Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:27:14.6727415Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:27:14.6727570Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:27:14.6727724Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:27:14.6727822Z #define _IO_funlockfile(_fp) 
2025-05-07T20:27:14.6728140Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:14.6728270Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:27:14.6728365Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:27:14.6728451Z #define __FLT_RADIX__ 2
2025-05-07T20:27:14.6728555Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:27:14.6728715Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:27:14.6728809Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:27:14.6728907Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:27:14.6729013Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:27:14.6729109Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:27:14.6729205Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:27:14.6729386Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:27:14.6729472Z #define WORD_BIT 32
2025-05-07T20:27:14.6729556Z #define _IO_USER_BUF 1
2025-05-07T20:27:14.6729647Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:27:14.6729751Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6729859Z #define cudaHostAllocPortable 0x01
2025-05-07T20:27:14.6729957Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:27:14.6730058Z #define __long_double_t long double
2025-05-07T20:27:14.6730152Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:27:14.6730242Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:27:14.6730636Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:27:14.6730719Z #define __k8 1
2025-05-07T20:27:14.6730919Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:27:14.6731084Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:27:14.6731200Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:27:14.6731308Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:27:14.6731403Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:27:14.6731502Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:27:14.6731598Z #define __blksize_t_defined 
2025-05-07T20:27:14.6731691Z #define _IO_SHOWPOINT 0400
2025-05-07T20:27:14.6731787Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:27:14.6731900Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:27:14.6731993Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:27:14.6732102Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:27:14.6732195Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:27:14.6732290Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:27:14.6732544Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:27:14.6732883Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:27:14.6732984Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:27:14.6733087Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:27:14.6733170Z #define SEEK_SET 0
2025-05-07T20:27:14.6733267Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:27:14.6733363Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:27:14.6733548Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:27:14.6733651Z #define __cudaCDP2GetLastError 
2025-05-07T20:27:14.6733745Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:27:14.6733833Z #define _MATH_H_MATHDEF 1
2025-05-07T20:27:14.6734173Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:27:14.6734290Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:27:14.6734387Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:27:14.6734478Z #define __stub_sigreturn 
2025-05-07T20:27:14.6734789Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:27:14.6734886Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:27:14.6734983Z #define __HOST_CONFIG_H__ 
2025-05-07T20:27:14.6735082Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:27:14.6735168Z #define CLOCK_TAI 11
2025-05-07T20:27:14.6735271Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:27:14.6735474Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:27:14.6735565Z #define __restrict_arr 
2025-05-07T20:27:14.6735674Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:27:14.6735811Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:27:14.6736326Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:14.6736511Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:27:14.6736601Z #define __USE_MISC 1
2025-05-07T20:27:14.6736707Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:27:14.6736883Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:27:14.6736973Z #define _GCC_LIMITS_H_ 
2025-05-07T20:27:14.6737059Z #define __LDBL_DIG__ 18
2025-05-07T20:27:14.6737154Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:27:14.6737259Z #define __malloc_and_calloc_defined 
2025-05-07T20:27:14.6737350Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:27:14.6737452Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:27:14.6737538Z #define __x86_64__ 1
2025-05-07T20:27:14.6737618Z #define _SIZE_T_ 
2025-05-07T20:27:14.6738493Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:27:14.6738593Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:27:14.6738688Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:27:14.6738810Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:27:14.6738923Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:27:14.6739016Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:27:14.6739124Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:27:14.6739243Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:27:14.6739380Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:27:14.6739476Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:27:14.6739927Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:14.6740055Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:27:14.6740201Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:27:14.6740300Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:27:14.6740396Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:27:14.6740486Z #define STA_FLL 0x0008
2025-05-07T20:27:14.6740624Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:27:14.6740722Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:27:14.6740841Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6740950Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:27:14.6741036Z #define __stub_revoke 
2025-05-07T20:27:14.6741128Z #define __timer_t_defined 1
2025-05-07T20:27:14.6741261Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:27:14.6741351Z #define INT_MAX __INT_MAX__
2025-05-07T20:27:14.6741453Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:27:14.6741561Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:27:14.6741656Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:27:14.6741756Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:27:14.6741968Z #define cudaArrayTextureGather 0x08
2025-05-07T20:27:14.6742068Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:27:14.6742212Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:27:14.6742312Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:27:14.6742400Z #define _IO_off_t __off_t
2025-05-07T20:27:14.6742490Z #define __FLT64_DIG__ 15
2025-05-07T20:27:14.6742704Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:27:14.6742799Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:27:14.6742929Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6743047Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:27:14.6743143Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:27:14.6743247Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:27:14.6743330Z #define NULL __null
2025-05-07T20:27:14.6743458Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:27:14.6743563Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:27:14.6743666Z #define __U64_TYPE unsigned long int
2025-05-07T20:27:14.6743763Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6743860Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:27:14.6744048Z #define FP_ZERO 2
2025-05-07T20:27:14.6744167Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:27:14.6744315Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:27:14.6744421Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6744508Z #define __WCHAR_T__ 
2025-05-07T20:27:14.6744600Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:27:14.6744792Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:27:14.6744942Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:27:14.6745037Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:27:14.6745156Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:27:14.6745269Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:27:14.6745394Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:27:14.6745524Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:27:14.6745617Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:27:14.6745707Z #define _SIGSET_H_types 1
2025-05-07T20:27:14.6745824Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:27:14.6745926Z #define __cpp_unicode_literals 200710L
2025-05-07T20:27:14.6746070Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:27:14.6746173Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:27:14.6746289Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:27:14.6746419Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:27:14.6746524Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:27:14.6746648Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:27:14.6746761Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:27:14.6746930Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:27:14.6747025Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:27:14.6747137Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:27:14.6747236Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:27:14.6747324Z #define STA_MODE 0x4000
2025-05-07T20:27:14.6747439Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:27:14.6747539Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:27:14.6747654Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:27:14.6747756Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:27:14.6747852Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:27:14.6747960Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:27:14.6748055Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:27:14.6748164Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:27:14.6748255Z #define __SIZE_WIDTH__ 64
2025-05-07T20:27:14.6748370Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6748451Z #define __SEG_FS 1
2025-05-07T20:27:14.6748540Z #define _IO_size_t size_t
2025-05-07T20:27:14.6748637Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:27:14.6748812Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:27:14.6748903Z #define __stub_lchmod 
2025-05-07T20:27:14.6748994Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:27:14.6749109Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6749206Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:27:14.6749287Z #define __SEG_GS 1
2025-05-07T20:27:14.6749468Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:27:14.6749555Z #define _IOS_APPEND 8
2025-05-07T20:27:14.6749648Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:27:14.6749742Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:27:14.6749838Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:27:14.6749933Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:27:14.6750037Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:27:14.6750126Z #define htole16(x) (x)
2025-05-07T20:27:14.6750231Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:14.6750329Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:27:14.6750422Z #define __INT16_TYPE__ short int
2025-05-07T20:27:14.6750530Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:27:14.6750635Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:27:14.6750826Z #define __cpp_structured_bindings 201606L
2025-05-07T20:27:14.6750952Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:27:14.6751039Z #define __SIZEOF_INT__ 4
2025-05-07T20:27:14.6751128Z #define __WCLONE 0x80000000
2025-05-07T20:27:14.6751222Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:27:14.6751306Z #define SEEK_HOLE 4
2025-05-07T20:27:14.6751393Z #define TIMER_ABSTIME 1
2025-05-07T20:27:14.6751489Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:27:14.6751580Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:27:14.6751752Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:14.6751865Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6751959Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:27:14.6752072Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:27:14.6752171Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6752291Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:27:14.6752382Z #define _LINUX_LIMITS_H 
2025-05-07T20:27:14.6752469Z #define linux 1
2025-05-07T20:27:14.6752561Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:27:14.6752673Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:27:14.6752769Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:27:14.6752861Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:27:14.6752970Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:27:14.6753116Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:27:14.6753214Z #define __cpp_lib_hypot 201603
2025-05-07T20:27:14.6753310Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6753407Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:27:14.6753499Z #define MOD_NANO ADJ_NANO
2025-05-07T20:27:14.6753583Z #define htole64(x) (x)
2025-05-07T20:27:14.6753682Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:27:14.6753813Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:27:14.6753907Z #define _IO_UPPERCASE 01000
2025-05-07T20:27:14.6754387Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:27:14.6754480Z #define __USE_POSIX2 1
2025-05-07T20:27:14.6754577Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:27:14.6754666Z #define __WALL 0x40000000
2025-05-07T20:27:14.6754766Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:27:14.6754848Z #define _XLOCALE_H 1
2025-05-07T20:27:14.6754947Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:27:14.6755045Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:27:14.6755140Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:27:14.6755247Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:27:14.6755333Z #define __EXCEPTIONS 1
2025-05-07T20:27:14.6755490Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:27:14.6755681Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:27:14.6755768Z #define __WORDSIZE 64
2025-05-07T20:27:14.6755969Z #define CLOCK_MONOTONIC 1
2025-05-07T20:27:14.6756062Z #define _STL_RELOPS_H 1
2025-05-07T20:27:14.6756157Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:27:14.6756259Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:27:14.6756359Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:27:14.6756450Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:27:14.6756550Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:27:14.6756842Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:27:14.6757069Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:14.6757188Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:27:14.6757285Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:27:14.6757385Z #define __cpp_range_based_for 201603L
2025-05-07T20:27:14.6757496Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:27:14.6757596Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:27:14.6757706Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:27:14.6757888Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:27:14.6757985Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:27:14.6758160Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:27:14.6758265Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:27:14.6758434Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:14.6758549Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:27:14.6758631Z #define _STRING_H 1
2025-05-07T20:27:14.6758731Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:27:14.6758823Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:27:14.6758920Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:27:14.6759051Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:27:14.6759148Z #define __code_model_small__ 1
2025-05-07T20:27:14.6759235Z #define _PSTL_CONFIG_H 
2025-05-07T20:27:14.6759340Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:27:14.6759452Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:27:14.6759557Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:27:14.6759660Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:27:14.6759998Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:14.6760091Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:27:14.6760181Z #define le64toh(x) (x)
2025-05-07T20:27:14.6760272Z #define FILENAME_MAX 4096
2025-05-07T20:27:14.6760418Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:27:14.6760537Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:27:14.6760619Z #define L_cuserid 9
2025-05-07T20:27:14.6760711Z #define __ino_t_defined 
2025-05-07T20:27:14.6760791Z #define __k8__ 1
2025-05-07T20:27:14.6760889Z #define __INTPTR_TYPE__ long int
2025-05-07T20:27:14.6760999Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:27:14.6761087Z #define __int8_t_defined 
2025-05-07T20:27:14.6761180Z #define __WCHAR_TYPE__ int
2025-05-07T20:27:14.6761289Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:27:14.6761402Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:27:14.6761498Z #define __SLONGWORD_TYPE long int
2025-05-07T20:27:14.6761624Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:27:14.6761772Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:27:14.6761857Z #define __HAVE_COLUMN 
2025-05-07T20:27:14.6761945Z #define __stub_fdetach 
2025-05-07T20:27:14.6762344Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:27:14.6762429Z #define __pic__ 2
2025-05-07T20:27:14.6762544Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6762640Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:27:14.6762737Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:27:14.6762839Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:27:14.6762925Z #define __stub_chflags 
2025-05-07T20:27:14.6763014Z #define CLOCK_BOOTTIME 7
2025-05-07T20:27:14.6763179Z #define __need_IOV_MAX 
2025-05-07T20:27:14.6763286Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:27:14.6763395Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:27:14.6763491Z #define __cpp_decltype 200707L
2025-05-07T20:27:14.6763593Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:27:14.6763689Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:27:14.6763794Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:27:14.6763882Z #define TTY_NAME_MAX 32
2025-05-07T20:27:14.6764043Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:27:14.6764162Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6764329Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:27:14.6764439Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:27:14.6764531Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:27:14.6764627Z #define STA_PPSTIME 0x0004
2025-05-07T20:27:14.6764711Z #define __import__ 
2025-05-07T20:27:14.6764806Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:27:14.6764943Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:27:14.6765026Z #define __export__ 
2025-05-07T20:27:14.6765228Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:27:14.6765326Z #define cudaMemAttachHost 0x02
2025-05-07T20:27:14.6766422Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:14.6766524Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:27:14.6766614Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:27:14.6766716Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:27:14.6766806Z #define _WCHAR_T_DECLARED 
2025-05-07T20:27:14.6766924Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:27:14.6767041Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:27:14.6767144Z #define __cpp_inline_variables 201606L
2025-05-07T20:27:14.6767239Z #define WNOWAIT 0x01000000
2025-05-07T20:27:14.6767327Z #define PLOSS 6
2025-05-07T20:27:14.6767421Z #define M_LN10 2.30258509299404568402
2025-05-07T20:27:14.6767685Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:27:14.6767777Z #define EXIT_SUCCESS 0
2025-05-07T20:27:14.6767881Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:27:14.6767983Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:27:14.6768084Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:27:14.6768173Z #define __thread__ __thread
2025-05-07T20:27:14.6768274Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:27:14.6768366Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:27:14.6768471Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:27:14.6768696Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:14.6768809Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:27:14.6768902Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:27:14.6768987Z #define __linux__ 1
2025-05-07T20:27:14.6769082Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:27:14.6769214Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:27:14.6769309Z #define __S16_TYPE short int
2025-05-07T20:27:14.6769647Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:27:14.6769765Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:27:14.6769950Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:27:14.6770049Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:27:14.6770153Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:27:14.6770236Z #define _T_SIZE_ 
2025-05-07T20:27:14.6770334Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:14.6770453Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:27:14.6770546Z #define _PSTL_VERSION 12000
2025-05-07T20:27:14.6770669Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:27:14.6770764Z #define __WNOTHREAD 0x20000000
2025-05-07T20:27:14.6770860Z #define _G_va_list __gnuc_va_list
2025-05-07T20:27:14.6770991Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:27:14.6771255Z #define _IOS_INPUT 1
2025-05-07T20:27:14.6771350Z #define __USE_LARGEFILE64 1
2025-05-07T20:27:14.6771458Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:27:14.6771555Z #define __INT64_TYPE__ long int
2025-05-07T20:27:14.6771649Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:27:14.6771751Z #define __shared__ __location__(shared)
2025-05-07T20:27:14.6771842Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:27:14.6771995Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:27:14.6772087Z #define __gid_t_defined 
2025-05-07T20:27:14.6772196Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:27:14.6772296Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:27:14.6772491Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:27:14.6772588Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:27:14.6772683Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:27:14.6772769Z #define ___int_size_t_h 
2025-05-07T20:27:14.6772880Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:14.6773005Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:27:14.6773157Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:27:14.6773376Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:27:14.6773474Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:27:14.6773570Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:27:14.6773666Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:27:14.6773810Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6773930Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:27:14.6774064Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:27:14.6774155Z #define __clock_t_defined 1
2025-05-07T20:27:14.6774253Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:27:14.6774363Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:27:14.6774453Z #define __GLIBC_MINOR__ 17
2025-05-07T20:27:14.6774544Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:27:14.6774644Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:27:14.6774757Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:27:14.6774846Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:27:14.6775024Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:14.6775105Z #define __SSE__ 1
2025-05-07T20:27:14.6775202Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:27:14.6775296Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:27:14.6775381Z #define _CTYPE_H 1
2025-05-07T20:27:14.6775479Z #define __sigset_t_defined 
2025-05-07T20:27:14.6775574Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:27:14.6775667Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:27:14.6775756Z #define MOD_TAI ADJ_TAI
2025-05-07T20:27:14.6775851Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:27:14.6775944Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:27:14.6776033Z #define __SM_70_RT_H__ 
2025-05-07T20:27:14.6776125Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:27:14.6776230Z #define cudaEventWaitDefault 0x00
2025-05-07T20:27:14.6776337Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:27:14.6776494Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:14.6776592Z #define _POSIX_MAX_CANON 255
2025-05-07T20:27:14.6776704Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:27:14.6776797Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:27:14.6776891Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:27:14.6776973Z #define __amd64__ 1
2025-05-07T20:27:14.6777061Z #define __WINT_WIDTH__ 32
2025-05-07T20:27:14.6777168Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:27:14.6777433Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:14.6777532Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:27:14.6777617Z #define EOF (-1)
2025-05-07T20:27:14.6777712Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:27:14.6777809Z #define __USE_POSIX199309 1
2025-05-07T20:27:14.6777904Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:27:14.6777997Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:27:14.6778176Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:27:14.6778276Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:27:14.6778386Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:27:14.6778487Z #define ____mbstate_t_defined 1
2025-05-07T20:27:14.6778572Z #define STA_NANO 0x2000
2025-05-07T20:27:14.6778666Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:27:14.6778763Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:27:14.6778849Z #define _IO_LINKED 0x80
2025-05-07T20:27:14.6778946Z #define __cpp_lib_launder 201606
2025-05-07T20:27:14.6779040Z #define __SIZEOF_INT128__ 16
2025-05-07T20:27:14.6779144Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:27:14.6779240Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:27:14.6779332Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:27:14.6779470Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:27:14.6779580Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6779681Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:14.6779781Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:27:14.6779877Z #define __W_CONTINUED 0xffff
2025-05-07T20:27:14.6779966Z #define __ATOMIC_RELAXED 0
2025-05-07T20:27:14.6780172Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:27:14.6780295Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:14.6780496Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:27:14.6780679Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:27:14.6780768Z #define __stub_stty 
2025-05-07T20:27:14.6780929Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:27:14.6781019Z #define le16toh(x) (x)
2025-05-07T20:27:14.6781125Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:27:14.6781295Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:27:14.6781379Z #define _SIZET_ 
2025-05-07T20:27:14.6781469Z #define XATTR_NAME_MAX 255
2025-05-07T20:27:14.6781553Z #define _SVID_SOURCE 1
2025-05-07T20:27:14.6781643Z #define _LP64 1
2025-05-07T20:27:14.6781734Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:27:14.6781963Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:27:14.6782082Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:27:14.6782166Z #define __UINT8_C(c) c
2025-05-07T20:27:14.6782264Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:27:14.6782358Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:27:14.6782468Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:27:14.6782565Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:27:14.6782659Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:27:14.6782757Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:27:14.6782845Z #define CUDARTAPI 
2025-05-07T20:27:14.6782928Z #define IOV_MAX 1024
2025-05-07T20:27:14.6783070Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:27:14.6783175Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:27:14.6783268Z #define P_tmpdir "/tmp"
2025-05-07T20:27:14.6783375Z #define cudaMemAttachSingle 0x04
2025-05-07T20:27:14.6783462Z #define __wchar_t__ 
2025-05-07T20:27:14.6783563Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:27:14.6783654Z #define SEEK_END 2
2025-05-07T20:27:14.6783746Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:27:14.6783939Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:27:14.6784063Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:27:14.6784203Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:27:14.6784293Z #define ____FILE_defined 1
2025-05-07T20:27:14.6784410Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:27:14.6784505Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:27:14.6784592Z #define _ISOC99_SOURCE 1
2025-05-07T20:27:14.6784690Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:27:14.6784931Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:14.6785062Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:27:14.6785145Z #define _IO_RIGHT 04
2025-05-07T20:27:14.6785320Z #define __END_NAMESPACE_STD 
2025-05-07T20:27:14.6785506Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:14.6785603Z #define _GLIBCXX_STD_C std
2025-05-07T20:27:14.6785720Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:27:14.6785816Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:27:14.6785917Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:27:14.6785998Z #define _STDDEF_H_ 
2025-05-07T20:27:14.6786170Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:14.6786267Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6786385Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:27:14.6786577Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:27:14.6786687Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:14.6786827Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:27:14.6786952Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:27:14.6787051Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:27:14.6787163Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:27:14.6787336Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:27:14.6787447Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:27:14.6787549Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:27:14.6787641Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:27:14.6787734Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:27:14.6787905Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:27:14.6787996Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:27:14.6788173Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:27:14.6788271Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:27:14.6788364Z #define __STDCPP_THREADS__ 1
2025-05-07T20:27:14.6788508Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:27:14.6788603Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:27:14.6788701Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:27:14.6788803Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:27:14.6788918Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:27:14.6789015Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:27:14.6789118Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:27:14.6789279Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:27:14.6789450Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:27:14.6789547Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:27:14.6789665Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:27:14.6789777Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:27:14.6789877Z #define __location__(a) __annotate__(a)
2025-05-07T20:27:14.6790098Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:27:14.6790200Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:27:14.6790311Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:27:14.6790410Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:27:14.6790501Z #define __STDC_UTF_32__ 1
2025-05-07T20:27:14.6790594Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:27:14.6790697Z #define NAN (__builtin_nanf (""))
2025-05-07T20:27:14.6790792Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:27:14.6790873Z #define __FXSR__ 1
2025-05-07T20:27:14.6790957Z #define _SIZE_T 
2025-05-07T20:27:14.6791060Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:27:14.6791169Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:27:14.6791335Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:14.6791483Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:27:14.6791575Z #define _IO_ssize_t __ssize_t
2025-05-07T20:27:14.6791675Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:27:14.6791853Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:14.6792051Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:27:14.6792225Z #define _GXX_NULLPTR_T 
2025-05-07T20:27:14.6792349Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:27:14.6792439Z #define FOPEN_MAX 16
2025-05-07T20:27:14.6792530Z #define __BIG_ENDIAN 4321
2025-05-07T20:27:14.6792644Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:14.6792745Z #define __suseconds_t_defined 
2025-05-07T20:27:14.6792833Z #define __off_t_defined 
2025-05-07T20:27:14.6792918Z #define stderr stderr
2025-05-07T20:27:14.6793015Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:27:14.6793124Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:27:14.6793219Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:27:14.6793312Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:27:14.6793712Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:27:14.6793809Z #define __mode_t_defined 
2025-05-07T20:27:14.6793913Z #define _GCC_SIZE_T 
2025-05-07T20:27:14.6794026Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:14.6794144Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:27:14.6794248Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:27:14.6794444Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:27:14.6794537Z #define __UINT32_C(c) c ## U
2025-05-07T20:27:14.6794638Z #define __cpp_alias_templates 200704L
2025-05-07T20:27:14.6794741Z #define cudaHostAllocMapped 0x02
2025-05-07T20:27:14.6794848Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:27:14.6794937Z #define _STL_ITERATOR_H 1
2025-05-07T20:27:14.6795020Z #define __size_t__ 
2025-05-07T20:27:14.6795148Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:27:14.6795242Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:27:14.6795351Z #define cudaEventRecordExternal 0x01
2025-05-07T20:27:14.6795565Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:27:14.6795658Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:27:14.6795825Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:27:14.6795913Z #define _ENDIAN_H 1
2025-05-07T20:27:14.6796016Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:27:14.6796117Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:27:14.6796217Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:27:14.6796299Z #define __try try
2025-05-07T20:27:14.6796394Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:27:14.6796486Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:27:14.6796575Z #define __INT8_MAX__ 0x7f
2025-05-07T20:27:14.6796827Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:27:14.6796915Z #define __LONG_WIDTH__ 64
2025-05-07T20:27:14.6796999Z #define __PIC__ 2
2025-05-07T20:27:14.6797107Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:27:14.6797222Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:27:14.6797353Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:27:14.6797449Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:27:14.6797546Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:27:14.6797732Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:14.6797832Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:27:14.6797938Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:27:14.6798026Z #define _IO_uid_t __uid_t
2025-05-07T20:27:14.6798123Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:27:14.6798250Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:27:14.6798341Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:27:14.6798483Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:14.6798586Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:27:14.6798704Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:27:14.6798789Z #define LONG_BIT 64
2025-05-07T20:27:14.6798896Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:27:14.6798994Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:27:14.6799123Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:27:14.6799301Z #define __fsfilcnt_t_defined 
2025-05-07T20:27:14.6799394Z #define __blkcnt_t_defined 
2025-05-07T20:27:14.6799662Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:14.6799755Z #define __USE_LARGEFILE 1
2025-05-07T20:27:14.6799853Z #define __cpp_constexpr 201603L
2025-05-07T20:27:14.6799948Z #define CUDART_VERSION 12080
2025-05-07T20:27:14.6800036Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:27:14.6800136Z #define cudaDeviceMapHost 0x08
2025-05-07T20:27:14.6800226Z #define _GLIBCXX_CMATH 1
2025-05-07T20:27:14.6800418Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:27:14.6800508Z #define __lldiv_t_defined 1
2025-05-07T20:27:14.6800591Z #define __SSE2__ 1
2025-05-07T20:27:14.6800672Z #define _IOLBF 1
2025-05-07T20:27:14.6800774Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:27:14.6800867Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:27:14.6800969Z #define __cpp_deduction_guides 201703L
2025-05-07T20:27:14.6801070Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:27:14.6801178Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:27:14.6801268Z #define __INT32_TYPE__ int
2025-05-07T20:27:14.6801445Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:27:14.6801550Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:27:14.6801649Z #define __cpp_exceptions 199711L
2025-05-07T20:27:14.6801745Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:27:14.6801852Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:27:14.6801942Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:27:14.6802061Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:27:14.6802218Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:27:14.6802315Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:27:14.6802407Z #define __SWORD_TYPE long int
2025-05-07T20:27:14.6802499Z #define __INTMAX_TYPE__ long int
2025-05-07T20:27:14.6802595Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:27:14.6802687Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:27:14.6802782Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:27:14.6803061Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:14.6803159Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:27:14.6803301Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:27:14.6803382Z #define _T_SIZE 
2025-05-07T20:27:14.6803486Z #define cudaHostAllocDefault 0x00
2025-05-07T20:27:14.6803611Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:14.6803757Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:27:14.6803858Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:27:14.6803967Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:27:14.6804084Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:27:14.6804182Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6804275Z #define __ATOMIC_CONSUME 1
2025-05-07T20:27:14.6804447Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:27:14.6804534Z #define __GNUC_MINOR__ 4
2025-05-07T20:27:14.6804642Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:27:14.6804734Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:27:14.6804849Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6804938Z #define __PIE__ 2
2025-05-07T20:27:14.6805039Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:27:14.6805140Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:27:14.6805326Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:27:14.6805541Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:14.6805635Z #define __nlink_t_defined 
2025-05-07T20:27:14.6805760Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:27:14.6805868Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:27:14.6805956Z #define _XOPEN_LIM_H 1
2025-05-07T20:27:14.6806209Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:14.6806409Z #define __cpp_template_template_args 201611L
2025-05-07T20:27:14.6806514Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:27:14.6806614Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:27:14.6806714Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:27:14.6806802Z #define __FILE_defined 1
2025-05-07T20:27:14.6806976Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:27:14.6807074Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:27:14.6807167Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:27:14.6807272Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:27:14.6807387Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:27:14.6807498Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:27:14.6807598Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:27:14.6807683Z #define __INT16_C(c) c
2025-05-07T20:27:14.6807822Z #define __U32_TYPE unsigned int
2025-05-07T20:27:14.6807952Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:27:14.6808106Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:27:14.6811908Z #define __STDC__ 1
2025-05-07T20:27:14.6812029Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:27:14.6812135Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:27:14.6812343Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:27:14.6812499Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:27:14.6812590Z #define __FLT32X_DIG__ 15
2025-05-07T20:27:14.6812695Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:27:14.6812793Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:27:14.6812908Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:27:14.6813022Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:27:14.6813121Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:27:14.6813224Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:27:14.6813313Z #define stdin stdin
2025-05-07T20:27:14.6813405Z #define __ino64_t_defined 
2025-05-07T20:27:14.6813494Z #define STA_CLK 0x8000
2025-05-07T20:27:14.6813588Z #define __clockid_t_defined 1
2025-05-07T20:27:14.6813745Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:27:14.6813935Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:27:14.6814066Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:27:14.6814173Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:27:14.6814276Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:27:14.6814382Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:27:14.6814580Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:27:14.6814673Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:27:14.6815192Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:27:14.6815276Z #define DOMAIN 1
2025-05-07T20:27:14.6815368Z #define M_LN2 0.69314718055994530942
2025-05-07T20:27:14.6815455Z #define __NVCC__ 1
2025-05-07T20:27:14.6815564Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:27:14.6815680Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:14.6815780Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:27:14.6815890Z #define __throw_exception_again throw
2025-05-07T20:27:14.6815987Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:27:14.6816077Z #define __EXCEPTION_H 1
2025-05-07T20:27:14.6816174Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:27:14.6816277Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:27:14.6816575Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:14.6816685Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:27:14.6816786Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:27:14.6816882Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:27:14.6816986Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:27:14.6817081Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:27:14.6817220Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:27:14.6817434Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:14.6817545Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:27:14.6817644Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:27:14.6817749Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:27:14.6817844Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:27:14.6817944Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:27:14.6818081Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:27:14.6818173Z #define __useconds_t_defined 
2025-05-07T20:27:14.6818272Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:27:14.6818450Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:27:14.6818593Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:27:14.6818681Z #define __SSE_MATH__ 1
2025-05-07T20:27:14.6818770Z #define _IO_wint_t wint_t
2025-05-07T20:27:14.6818862Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:27:14.6818955Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:27:14.6819053Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:27:14.6819165Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:27:14.6819264Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:27:14.6819440Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:27:14.6819525Z #define __USE_ATFILE 1
2025-05-07T20:27:14.6819626Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:27:14.6819720Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:27:14.6819809Z #define _GCC_PTRDIFF_T 
2025-05-07T20:27:14.6820030Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:14.6820126Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:27:14.6820228Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:27:14.6820328Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:27:14.6820435Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:27:14.6820520Z #define _STDLIB_H 1
2025-05-07T20:27:14.6820655Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:27:14.6820749Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:27:14.6820854Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:27:14.6820981Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:14.6821094Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:14.6821190Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:27:14.6821369Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:27:14.6821524Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:27:14.6821626Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:27:14.6821740Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:27:14.6821835Z #define __ldiv_t_defined 1
2025-05-07T20:27:14.6822010Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:27:14.6822102Z #define ___int_ptrdiff_t_h 
2025-05-07T20:27:14.6822271Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:14.6822371Z #define __cudaCDP2EventDestroy 
2025-05-07T20:27:14.6822461Z #define __HOST_DEFINES_H__ 
2025-05-07T20:27:14.6822571Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:27:14.6822671Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:14.6822770Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:27:14.6822857Z #define CUDART_CB 
2025-05-07T20:27:14.6822957Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:27:14.6823080Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:27:14.6823165Z #define MB_LEN_MAX 16
2025-05-07T20:27:14.6823383Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:14.6823483Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:27:14.6823604Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:27:14.6823714Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:27:14.6823818Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:27:14.6823965Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:27:14.6824095Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:27:14.6824190Z #define _GNU_SOURCE 1
2025-05-07T20:27:14.6824375Z #define __stub_putmsg 
2025-05-07T20:27:14.6824462Z #define __CUDACC__ 1
2025-05-07T20:27:14.6824552Z #define __N(msgid) (msgid)
2025-05-07T20:27:14.6824641Z #define __P(args) args
2025-05-07T20:27:14.6824891Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:27:14.6824992Z #define __cpp_init_captures 201304L
2025-05-07T20:27:14.6825095Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:27:14.6825189Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:27:14.6825287Z #define __cpp_lib_as_const 201510
2025-05-07T20:27:14.6825368Z #define __WCHAR_T 
2025-05-07T20:27:14.6825461Z #define __ATOMIC_RELEASE 3
2025-05-07T20:27:14.6825553Z #define __fsblkcnt_t_defined 
2025-05-07T20:27:14.6825669Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:27:14.6825771Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:27:14.6825777Z 
2025-05-07T20:27:14.6957306Z 
2025-05-07T20:27:14.6957832Z + conda run -n build_binary nvcc --version
2025-05-07T20:27:14.6957837Z 
2025-05-07T20:27:16.5924670Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:16.5925074Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:27:16.5925663Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:27:16.5925973Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:27:16.5926306Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:27:16.5926519Z 
2025-05-07T20:27:16.6536936Z 
2025-05-07T20:27:16.6549536Z /usr/bin/nvidia-smi
2025-05-07T20:27:16.6554521Z + nvidia-smi
2025-05-07T20:27:16.6554725Z 
2025-05-07T20:27:16.6733840Z Wed May  7 20:27:16 2025       
2025-05-07T20:27:16.6734242Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:16.6734740Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:16.6735238Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:16.6735810Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:16.6736338Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:16.6736783Z |                                         |                        |               MIG M. |
2025-05-07T20:27:16.6737124Z |=========================================+========================+======================|
2025-05-07T20:27:16.6904065Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:16.6904510Z |  0%   26C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:16.6904902Z |                                         |                        |                  N/A |
2025-05-07T20:27:16.6905293Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:16.6907360Z                                                                                          
2025-05-07T20:27:16.6907771Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:16.6908197Z | Processes:                                                                              |
2025-05-07T20:27:16.6908634Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:16.6909050Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:16.6909390Z |=========================================================================================|
2025-05-07T20:27:16.6911459Z |  No running processes found                                                             |
2025-05-07T20:27:16.6911988Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:16.9309447Z 
2025-05-07T20:27:16.9314458Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:27:16.9372264Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:27:16.9372899Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:27:16.9386667Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:16.9387032Z env:
2025-05-07T20:27:16.9387268Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:16.9387569Z   BUILD_ENV: build_binary
2025-05-07T20:27:16.9387824Z   BUILD_TARGET: genai
2025-05-07T20:27:16.9388053Z   BUILD_VARIANT: cuda
2025-05-07T20:27:16.9388282Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:27:16.9388539Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:16.9388841Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:16.9389179Z ##[endgroup]
2025-05-07T20:27:17.2773521Z ################################################################################
2025-05-07T20:27:17.2773896Z # Install PyTorch (PIP)
2025-05-07T20:27:17.2774137Z #
2025-05-07T20:27:17.2788763Z # [2025-05-07T20:27:17.278Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:27:17.2789217Z ################################################################################
2025-05-07T20:27:17.2789434Z 
2025-05-07T20:27:17.2817234Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:18.2747806Z Channels:
2025-05-07T20:27:18.2748265Z  - conda-forge
2025-05-07T20:27:18.2748714Z Platform: linux-64
2025-05-07T20:27:21.5295655Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:22.2489187Z Solving environment: \ | / done
2025-05-07T20:27:22.4692768Z 
2025-05-07T20:27:22.4693113Z ## Package Plan ##
2025-05-07T20:27:22.4693275Z 
2025-05-07T20:27:22.4693501Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:22.4693798Z 
2025-05-07T20:27:22.4693895Z   added / updated specs:
2025-05-07T20:27:22.4694139Z     - numpy
2025-05-07T20:27:22.4694256Z 
2025-05-07T20:27:22.4694291Z 
2025-05-07T20:27:22.4694416Z The following packages will be downloaded:
2025-05-07T20:27:22.4694626Z 
2025-05-07T20:27:22.4694749Z     package                    |            build
2025-05-07T20:27:22.4695065Z     ---------------------------|-----------------
2025-05-07T20:27:22.4695460Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:22.4695917Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:22.4696360Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:22.4696809Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:22.4697264Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:22.4697731Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:22.4698171Z     numpy-2.2.5                |  py312h72c5963_0         8.1 MB  conda-forge
2025-05-07T20:27:22.4698556Z     ------------------------------------------------------------
2025-05-07T20:27:22.4698896Z                                            Total:        15.4 MB
2025-05-07T20:27:22.4699102Z 
2025-05-07T20:27:22.4699232Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:22.4699458Z 
2025-05-07T20:27:22.4699675Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:22.4700171Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:22.4700667Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:22.4701155Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:22.4701667Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:22.4702199Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:22.4703088Z   numpy              conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 
2025-05-07T20:27:22.4703365Z 
2025-05-07T20:27:22.4703369Z 
2025-05-07T20:27:22.4703373Z 
2025-05-07T20:27:22.4703514Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:22.4703883Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:22.4704106Z 
2025-05-07T20:27:22.4705122Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:22.4705361Z 
2025-05-07T20:27:22.4705367Z 
2025-05-07T20:27:22.4714774Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:22.4715029Z 
2025-05-07T20:27:22.4715174Z 
2025-05-07T20:27:22.4715184Z 
2025-05-07T20:27:22.4726555Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:22.4726855Z 
2025-05-07T20:27:22.4726861Z 
2025-05-07T20:27:22.4726866Z 
2025-05-07T20:27:22.4726883Z 
2025-05-07T20:27:22.4751872Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:22.4752141Z 
2025-05-07T20:27:22.4752155Z 
2025-05-07T20:27:22.4752159Z 
2025-05-07T20:27:22.4752170Z 
2025-05-07T20:27:22.4752174Z 
2025-05-07T20:27:22.4759146Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:22.4759561Z 
2025-05-07T20:27:22.4759852Z 
2025-05-07T20:27:22.4759857Z 
2025-05-07T20:27:22.4759874Z 
2025-05-07T20:27:22.4759880Z 
2025-05-07T20:27:22.4759885Z 
2025-05-07T20:27:22.5602776Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:22.5603065Z 
2025-05-07T20:27:22.5603070Z 
2025-05-07T20:27:22.5603075Z 
2025-05-07T20:27:22.5603090Z 
2025-05-07T20:27:22.6466258Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.6466518Z 
2025-05-07T20:27:22.6466523Z 
2025-05-07T20:27:22.6466903Z 
2025-05-07T20:27:22.6467120Z 
2025-05-07T20:27:22.6503253Z 
2025-05-07T20:27:22.6991952Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:22.6992228Z 
2025-05-07T20:27:22.6992232Z 
2025-05-07T20:27:22.6992236Z 
2025-05-07T20:27:22.6992248Z 
2025-05-07T20:27:22.6998690Z 
2025-05-07T20:27:22.7830569Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:22.7844641Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:22.7844890Z 
2025-05-07T20:27:22.7844894Z 
2025-05-07T20:27:22.7844898Z 
2025-05-07T20:27:22.7844901Z 
2025-05-07T20:27:22.7844905Z 
2025-05-07T20:27:22.7846337Z 
2025-05-07T20:27:22.7998742Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:22.7999014Z 
2025-05-07T20:27:22.7999018Z 
2025-05-07T20:27:22.7999021Z 
2025-05-07T20:27:22.7999025Z 
2025-05-07T20:27:22.7999028Z 
2025-05-07T20:27:22.7999032Z 
2025-05-07T20:27:22.8637977Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:22.8640216Z 
2025-05-07T20:27:22.8832789Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:22.8973076Z numpy-2.2.5          | 8.1 MB    | #          |  11% 
2025-05-07T20:27:22.8973323Z 
2025-05-07T20:27:22.8973328Z 
2025-05-07T20:27:22.8973331Z 
2025-05-07T20:27:22.8973335Z 
2025-05-07T20:27:22.8987934Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.8988189Z 
2025-05-07T20:27:22.8988201Z 
2025-05-07T20:27:22.8988205Z 
2025-05-07T20:27:22.8989108Z 
2025-05-07T20:27:22.9036939Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.9037192Z 
2025-05-07T20:27:22.9037460Z 
2025-05-07T20:27:22.9037783Z 
2025-05-07T20:27:22.9072330Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:22.9072615Z 
2025-05-07T20:27:22.9072619Z 
2025-05-07T20:27:22.9072623Z 
2025-05-07T20:27:22.9072627Z 
2025-05-07T20:27:22.9074369Z 
2025-05-07T20:27:22.9078141Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:22.9078409Z 
2025-05-07T20:27:22.9078420Z 
2025-05-07T20:27:22.9078423Z 
2025-05-07T20:27:22.9214270Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:22.9214776Z 
2025-05-07T20:27:22.9215449Z 
2025-05-07T20:27:22.9306114Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:22.9306388Z 
2025-05-07T20:27:22.9306392Z 
2025-05-07T20:27:22.9306405Z 
2025-05-07T20:27:22.9306409Z 
2025-05-07T20:27:22.9306412Z 
2025-05-07T20:27:22.9309399Z 
2025-05-07T20:27:22.9642077Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:22.9642688Z 
2025-05-07T20:27:22.9833994Z libopenblas-0.3.29   | 5.6 MB    | ######3    |  64% [A
2025-05-07T20:27:22.9945583Z numpy-2.2.5          | 8.1 MB    | ######     |  60% 
2025-05-07T20:27:22.9945812Z 
2025-05-07T20:27:22.9946336Z 
2025-05-07T20:27:22.9956831Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.9957108Z 
2025-05-07T20:27:22.9957112Z 
2025-05-07T20:27:22.9957116Z 
2025-05-07T20:27:23.0170465Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:23.0171453Z 
2025-05-07T20:27:23.0516628Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:23.0751680Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:23.0751917Z 
2025-05-07T20:27:23.0752242Z 
2025-05-07T20:27:23.0756192Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:23.0756732Z 
2025-05-07T20:27:23.0756737Z 
2025-05-07T20:27:23.1875998Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:23.1876533Z 
2025-05-07T20:27:23.4771834Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:23.4772219Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:23.4778469Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:23.4779174Z                                                      
2025-05-07T20:27:23.4779553Z 
2025-05-07T20:27:23.4779875Z                                                      [A
2025-05-07T20:27:23.4780192Z 
2025-05-07T20:27:23.4780198Z 
2025-05-07T20:27:23.4780470Z                                                      [A[A
2025-05-07T20:27:23.4780785Z 
2025-05-07T20:27:23.4780791Z 
2025-05-07T20:27:23.4780796Z 
2025-05-07T20:27:23.4781056Z                                                      [A[A[A
2025-05-07T20:27:23.4781373Z 
2025-05-07T20:27:23.4781379Z 
2025-05-07T20:27:23.4781384Z 
2025-05-07T20:27:23.4781401Z 
2025-05-07T20:27:23.4781652Z                                                      [A[A[A[A
2025-05-07T20:27:23.4781951Z 
2025-05-07T20:27:23.4781954Z 
2025-05-07T20:27:23.4781958Z 
2025-05-07T20:27:23.4781961Z 
2025-05-07T20:27:23.4781965Z 
2025-05-07T20:27:23.4782171Z                                                      [A[A[A[A[A
2025-05-07T20:27:23.4782446Z 
2025-05-07T20:27:23.4782452Z 
2025-05-07T20:27:23.4782457Z 
2025-05-07T20:27:23.4782462Z 
2025-05-07T20:27:23.4782467Z 
2025-05-07T20:27:23.4782472Z 
2025-05-07T20:27:23.4782744Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:23.5793605Z Preparing transaction: \ done
2025-05-07T20:27:23.6799086Z Verifying transaction: / done
2025-05-07T20:27:23.7807762Z Executing transaction: \ done
2025-05-07T20:27:23.9540862Z ################################################################################
2025-05-07T20:27:23.9541301Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:23.9541601Z #
2025-05-07T20:27:23.9556497Z # [2025-05-07T20:27:23.955Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:27:23.9556972Z ################################################################################
2025-05-07T20:27:23.9557195Z 
2025-05-07T20:27:23.9576275Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:24.0484913Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:24.0485419Z ################################################################################
2025-05-07T20:27:24.0485792Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:24.0486073Z #
2025-05-07T20:27:24.0503939Z # [2025-05-07T20:27:24.050Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:27:24.0504390Z ################################################################################
2025-05-07T20:27:24.0504605Z 
2025-05-07T20:27:24.0528082Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:24.0553303Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:27:24.0569494Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:24.0570038Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:27:24.0578274Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:24.0586446Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:27:24.0607377Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:29:02.0721830Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:29:02.0722452Z Collecting torch
2025-05-07T20:29:02.0723317Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:29:02.0724656Z Collecting filelock (from torch)
2025-05-07T20:29:02.0725329Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:29:02.0726341Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2)
2025-05-07T20:29:02.0727401Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1)
2025-05-07T20:29:02.0728065Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:29:02.0728559Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:29:02.0729398Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 61.4 MB/s eta 0:00:00
2025-05-07T20:29:02.0729752Z Collecting networkx (from torch)
2025-05-07T20:29:02.0730251Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:29:02.0730907Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 17.4 MB/s eta 0:00:00
2025-05-07T20:29:02.0731253Z Collecting jinja2 (from torch)
2025-05-07T20:29:02.0731731Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:29:02.0732249Z Collecting fsspec (from torch)
2025-05-07T20:29:02.0732731Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:29:02.0733303Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:29:02.0741614Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:02.0742489Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:29:02.0743323Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:02.0744151Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:29:02.0744947Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:02.0745731Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:29:02.0746421Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:29:02.0747110Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:29:02.0748043Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:02.0748748Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:29:02.0749524Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:02.0750304Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:29:02.0751006Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:02.0751720Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:29:02.0752439Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:02.0753152Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:29:02.0753959Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:02.0754760Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:29:02.0755478Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:29:02.0756388Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:29:02.0757150Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:29:02.0757914Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:29:02.0758679Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:02.0759458Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:29:02.0760307Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:02.0761127Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:29:02.0761911Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:02.0762717Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:29:02.0763549Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:02.0764366Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:29:02.0764918Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:29:02.0765954Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.1 MB/s eta 0:00:00
2025-05-07T20:29:02.0766637Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:29:02.0767424Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:29:02.0768467Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl (1047.0 MB)
2025-05-07T20:29:02.0769291Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 23.7 MB/s eta 0:00:00
2025-05-07T20:29:02.0769981Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:29:02.0770811Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 48.6 MB/s eta 0:00:00
2025-05-07T20:29:02.0771574Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:29:02.0772600Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 113.0 MB/s eta 0:00:00
2025-05-07T20:29:02.0773373Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:29:02.0774251Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 179.5 MB/s eta 0:00:00
2025-05-07T20:29:02.0775012Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:29:02.0775872Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 85.2 MB/s eta 0:00:00
2025-05-07T20:29:02.0776544Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:29:02.0777311Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 43.5 MB/s eta 0:00:00
2025-05-07T20:29:02.0778071Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:29:02.0778918Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 92.5 MB/s eta 0:00:00
2025-05-07T20:29:02.0779669Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:29:02.0780641Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 79.3 MB/s eta 0:00:00
2025-05-07T20:29:02.0781313Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:29:02.0782076Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 148.0 MB/s eta 0:00:00
2025-05-07T20:29:02.0782764Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:29:02.0783555Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 124.5 MB/s eta 0:00:00
2025-05-07T20:29:02.0784311Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:29:02.0785164Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 96.1 MB/s eta 0:00:00
2025-05-07T20:29:02.0785865Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:29:02.0786639Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 135.1 MB/s eta 0:00:00
2025-05-07T20:29:02.0787375Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:29:02.0788207Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 131.7 MB/s eta 0:00:00
2025-05-07T20:29:02.0789077Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:29:02.0790094Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 95.9 MB/s eta 0:00:00
2025-05-07T20:29:02.0790966Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:29:02.0792115Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:29:02.0792978Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 149.2 MB/s eta 0:00:00
2025-05-07T20:29:02.0794711Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:29:02.0796384Z 
2025-05-07T20:29:02.0798336Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:29:02.0800468Z 
2025-05-07T20:29:04.3030757Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:29:04.3033543Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:29:07.7964541Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:29:11.3195026Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:11.3195470Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:29:14.7747804Z True
2025-05-07T20:29:14.7748053Z True
2025-05-07T20:29:14.7748187Z 
2025-05-07T20:29:14.8394085Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:29:14.8433613Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:29:14.8434230Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:29:14.8448016Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:14.8448370Z env:
2025-05-07T20:29:14.8448608Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:14.8448912Z   BUILD_ENV: build_binary
2025-05-07T20:29:14.8449166Z   BUILD_TARGET: genai
2025-05-07T20:29:14.8449403Z   BUILD_VARIANT: cuda
2025-05-07T20:29:14.8449653Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:14.8449913Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:14.8450224Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:14.8450567Z ##[endgroup]
2025-05-07T20:29:15.1821931Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:29:15.1824783Z ################################################################################
2025-05-07T20:29:15.1825923Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:29:15.1826651Z #
2025-05-07T20:29:15.1840091Z # [2025-05-07T20:29:15.183Z] + collect_pytorch_env_info build_binary
2025-05-07T20:29:15.1840589Z ################################################################################
2025-05-07T20:29:15.1840806Z 
2025-05-07T20:29:15.1855521Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:15.2792443Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:15.2800825Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:29:15.2801670Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:29:15.2802169Z 
2025-05-07T20:29:15.3695741Z 
2025-05-07T20:29:15.3696434Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:29:15.3721071Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:29:21.2598401Z Collecting environment information...
2025-05-07T20:29:21.2598831Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:29:21.2599130Z Is debug build: False
2025-05-07T20:29:21.2599377Z CUDA used to build PyTorch: 12.8
2025-05-07T20:29:21.2599656Z ROCM used to build PyTorch: N/A
2025-05-07T20:29:21.2599831Z 
2025-05-07T20:29:21.2599941Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:29:21.2600263Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:29:21.2600574Z Clang version: Could not collect
2025-05-07T20:29:21.2600851Z CMake version: Could not collect
2025-05-07T20:29:21.2601121Z Libc version: glibc-2.34
2025-05-07T20:29:21.2601275Z 
2025-05-07T20:29:21.2601576Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:29:21.2602185Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:29:21.2602987Z Is CUDA available: True
2025-05-07T20:29:21.2603236Z CUDA runtime version: 12.8.61
2025-05-07T20:29:21.2603511Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:29:21.2603823Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:29:21.2604152Z Nvidia driver version: 570.133.07
2025-05-07T20:29:21.2604429Z cuDNN version: Could not collect
2025-05-07T20:29:21.2604700Z HIP runtime version: N/A
2025-05-07T20:29:21.2604955Z MIOpen runtime version: N/A
2025-05-07T20:29:21.2605210Z Is XNNPACK available: True
2025-05-07T20:29:21.2605380Z 
2025-05-07T20:29:21.2605458Z CPU:
2025-05-07T20:29:21.2605676Z Architecture:                         x86_64
2025-05-07T20:29:21.2606012Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:29:21.2606414Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:29:21.2606806Z Byte Order:                           Little Endian
2025-05-07T20:29:21.2607117Z CPU(s):                               16
2025-05-07T20:29:21.2607423Z On-line CPU(s) list:                  0-15
2025-05-07T20:29:21.2607956Z Vendor ID:                            AuthenticAMD
2025-05-07T20:29:21.2608301Z Model name:                           AMD EPYC 7R32
2025-05-07T20:29:21.2608614Z CPU family:                           23
2025-05-07T20:29:21.2608903Z Model:                                49
2025-05-07T20:29:21.2609194Z Thread(s) per core:                   2
2025-05-07T20:29:21.2609481Z Core(s) per socket:                   8
2025-05-07T20:29:21.2609767Z Socket(s):                            1
2025-05-07T20:29:21.2610052Z Stepping:                             0
2025-05-07T20:29:21.2610345Z BogoMIPS:                             5599.99
2025-05-07T20:29:21.2612425Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:29:21.2614513Z Hypervisor vendor:                    KVM
2025-05-07T20:29:21.2614828Z Virtualization type:                  full
2025-05-07T20:29:21.2615167Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:29:21.2615527Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:29:21.2615888Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:29:21.2616246Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:29:21.2616567Z NUMA node(s):                         1
2025-05-07T20:29:21.2616857Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:29:21.2617191Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:29:21.2617572Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:29:21.2617922Z Vulnerability L1tf:                   Not affected
2025-05-07T20:29:21.2618271Z Vulnerability Mds:                    Not affected
2025-05-07T20:29:21.2618622Z Vulnerability Meltdown:               Not affected
2025-05-07T20:29:21.2618973Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:29:21.2619336Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:29:21.2619873Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:29:21.2620446Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:29:21.2620977Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:29:21.2621657Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:29:21.2622515Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:29:21.2623272Z Vulnerability Srbds:                  Not affected
2025-05-07T20:29:21.2623635Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:29:21.2623869Z 
2025-05-07T20:29:21.2623972Z Versions of relevant libraries:
2025-05-07T20:29:21.2624238Z [pip3] numpy==2.2.5
2025-05-07T20:29:21.2624478Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:29:21.2624788Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:29:21.2625101Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:29:21.2625410Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:29:21.2625727Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:29:21.2626021Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:29:21.2626309Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:29:21.2626614Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:29:21.2626918Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:29:21.2627342Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:29:21.2627637Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:29:21.2627926Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:29:21.2628224Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:29:21.2628507Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:29:21.2628812Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:29:21.2629184Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:21.2629661Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:21.2630169Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:21.2630688Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:21.2631218Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:21.2631739Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:21.2632229Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2632697Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:29:21.2633173Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:29:21.2633671Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:29:21.2634151Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2634617Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:29:21.2635069Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2635629Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2636109Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:29:21.2636589Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:29:21.2637051Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:29:21.2637519Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:29:21.2637983Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2638435Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:29:21.2638899Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2639368Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:29:21.2639842Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:29:21.2640315Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:29:21.2640896Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2641376Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:29:21.2641855Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:29:21.2642341Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:29:21.2642804Z [conda] numpy                     2.2.5           py312h72c5963_0    conda-forge
2025-05-07T20:29:21.2643271Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:29:21.2643762Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:29:21.2644262Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:29:21.2644766Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:29:21.2645249Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:29:21.2645821Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:29:21.2646301Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:29:21.2646789Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:29:21.2647276Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:29:21.2647778Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:29:21.2648270Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:29:21.2648749Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:29:21.2649235Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:29:21.2649733Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:29:21.2650208Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:29:21.2650524Z 
2025-05-07T20:29:21.3400254Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:29:21.3400808Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:29:21.3412635Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:21.3412982Z env:
2025-05-07T20:29:21.3413215Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:21.3413515Z   BUILD_ENV: build_binary
2025-05-07T20:29:21.3413766Z   BUILD_TARGET: genai
2025-05-07T20:29:21.3413996Z   BUILD_VARIANT: cuda
2025-05-07T20:29:21.3414236Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:21.3414491Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:21.3414792Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:21.3415129Z ##[endgroup]
2025-05-07T20:29:21.6803624Z ################################################################################
2025-05-07T20:29:21.6804020Z # Prepare FBGEMM-GPU Build
2025-05-07T20:29:21.6804306Z #
2025-05-07T20:29:21.6818899Z # [2025-05-07T20:29:21.681Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:29:21.6819311Z ################################################################################
2025-05-07T20:29:21.6819527Z 
2025-05-07T20:29:21.6834147Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:21.7810171Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:21.7831320Z [BUILD] Running git submodules update ...
2025-05-07T20:29:21.7852283Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:29:21.8209361Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:29:21.8209990Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:29:21.8210443Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:29:21.8210843Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:29:21.8211247Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:29:21.8212029Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:29:21.8212444Z Synchronizing submodule url for '../external/json'
2025-05-07T20:29:21.8243646Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:29:21.8797307Z [BUILD] Installing other build dependencies ...
2025-05-07T20:29:21.8819115Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:29:24.2861390Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:29:24.3032401Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:29:24.4047417Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:29:24.4077260Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:29:24.6069937Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:29:24.6099450Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:29:24.7153253Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:29:24.7177964Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:29:25.0395017Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:29:25.0422906Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:29:25.0901529Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:29:25.0905255Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:29:25.1594806Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:29:25.1619549Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:29:25.2119814Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:29:25.2559785Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:29:25.2585652Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:29:25.3748109Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:29:25.3810375Z   Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:29:25.5023156Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:29:25.5125141Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:29:25.5662088Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:29:25.6273834Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:29:25.6310350Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:29:25.7286193Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:29:25.7312872Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:29:25.8415616Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:29:25.8446737Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:29:25.9460814Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:29:25.9501087Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:29:26.0423926Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:29:26.0455044Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:29:26.1471275Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:26.1498520Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:26.2513362Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:26.2536359Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:26.3022318Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:29:26.3540133Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:26.3563445Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:29:26.4036138Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:29:26.4518598Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:29:26.4674713Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:29:26.5145474Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:29:26.5778028Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:26.5802719Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:29:26.6278627Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:29:26.6761236Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:29:26.7237662Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:29:27.2662031Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 51.4 MB/s eta 0:00:00
2025-05-07T20:29:27.2687815Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:29:27.3160544Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:29:27.3737778Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:29:27.4235928Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:29:27.4745646Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:29:27.5196128Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
2025-05-07T20:29:27.5817059Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.4 MB/s eta 0:00:00
2025-05-07T20:29:27.5845444Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:29:27.6344224Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:27.6821457Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:29:27.7308408Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:29:27.7851200Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:29:27.8332565Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:29:27.8912682Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:29:27.9448869Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:27.9929518Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:29:28.0341625Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:29:28.2000233Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:29:30.4778742Z 
2025-05-07T20:29:30.4825513Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:29:30.6462134Z ################################################################################
2025-05-07T20:29:30.6462579Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:29:30.6462934Z #
2025-05-07T20:29:30.6481478Z # [2025-05-07T20:29:30.647Z] + install_triton_pip build_binary
2025-05-07T20:29:30.6481856Z ################################################################################
2025-05-07T20:29:30.6482071Z 
2025-05-07T20:29:30.6482290Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:29:30.6482753Z ################################################################################
2025-05-07T20:29:30.6483215Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:29:30.6483538Z #
2025-05-07T20:29:30.6499232Z # [2025-05-07T20:29:30.649Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:29:30.6499797Z ################################################################################
2025-05-07T20:29:30.6500009Z 
2025-05-07T20:29:30.6515534Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:30.7413072Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:30.7413547Z ################################################################################
2025-05-07T20:29:30.7413890Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:29:30.7414167Z #
2025-05-07T20:29:30.7431295Z # [2025-05-07T20:29:30.742Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:29:30.7431862Z ################################################################################
2025-05-07T20:29:30.7432083Z 
2025-05-07T20:29:30.7478951Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:29:30.7496237Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:29:30.7496980Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:30.7505353Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:30.7515200Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:29:30.7537462Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:38.4429363Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:38.4430592Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:38.4431249Z 
2025-05-07T20:29:38.4431471Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:38.4431880Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:38.4432670Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:38.4433876Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:38.4434945Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 56.1 MB/s eta 0:00:00
2025-05-07T20:29:38.4435329Z Installing collected packages: pytorch-triton
2025-05-07T20:29:38.4435785Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:38.4436169Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:38.4436583Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:38.4437378Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:38.4437814Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:38.4438069Z 
2025-05-07T20:29:40.6484993Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:40.6488427Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:42.7953129Z ################################################################################
2025-05-07T20:29:42.7953729Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:42.7954252Z ################################################################################
2025-05-07T20:29:42.7954558Z 
2025-05-07T20:29:44.8317417Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:46.9948276Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:46.9952622Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:46.9984919Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:46.9985409Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:46.9998073Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:46.9998426Z env:
2025-05-07T20:29:46.9998653Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:46.9998953Z   BUILD_ENV: build_binary
2025-05-07T20:29:46.9999207Z   BUILD_TARGET: genai
2025-05-07T20:29:46.9999441Z   BUILD_VARIANT: cuda
2025-05-07T20:29:46.9999723Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:46.9999983Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:47.0000284Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:47.0000621Z ##[endgroup]
2025-05-07T20:29:47.3355282Z ################################################################################
2025-05-07T20:29:47.3355767Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:47.3356118Z #
2025-05-07T20:29:47.3371740Z # [2025-05-07T20:29:47.336Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3372615Z ################################################################################
2025-05-07T20:29:47.3372835Z 
2025-05-07T20:29:47.3373193Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3373884Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3374226Z 
2025-05-07T20:29:47.3536564Z c73a702bbc09a0f1f522be4fc10889dc19360f75  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3538084Z 
2025-05-07T20:29:47.3538773Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3539137Z 
2025-05-07T20:29:47.3726716Z 3a160ecc54665559cce7e57cc15438640cf521df66903a79480f30a5b3cf6942  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3729189Z 
2025-05-07T20:29:47.3729687Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.3730026Z 
2025-05-07T20:29:47.4061663Z e7438d9eb3f38b23c683d9c8a7a66fd4  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:47.4063338Z 
2025-05-07T20:29:47.4073481Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:47.4094815Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:50.1846593Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:50.1847551Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:50.1848407Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:50.1849199Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:50.1849468Z 
2025-05-07T20:29:57.1939515Z ################################################################################
2025-05-07T20:29:57.1940427Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:57.1941482Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:57.1942515Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:57.1943140Z [CHECK]
2025-05-07T20:29:57.1943783Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:57.1944766Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:57.1945544Z ################################################################################
2025-05-07T20:29:57.1945981Z 
2025-05-07T20:29:57.1946212Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:30:01.2029747Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:05.1931363Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:09.1852816Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:09.1856190Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:30:21.1166966Z ################################################################################
2025-05-07T20:30:21.1167403Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:30:21.1167762Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:30:21.1168118Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:30:21.1168473Z ################################################################################
2025-05-07T20:30:21.1168690Z 
2025-05-07T20:30:29.1024659Z ################################################################################
2025-05-07T20:30:29.1025173Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:30:29.1026591Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:30:29.1028169Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:30:29.1028707Z ################################################################################
2025-05-07T20:30:29.1028932Z 
2025-05-07T20:30:29.1029089Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:33.0973832Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:37.0906529Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:41.1916830Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:45.2078227Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:45.2081671Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:49.1118335Z fbgemm.nccl_init
2025-05-07T20:30:49.1118520Z 
2025-05-07T20:30:49.1736971Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:53.0670027Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:53.0670236Z 
2025-05-07T20:30:53.1279533Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:57.0333660Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:57.0333863Z 
2025-05-07T20:30:57.0953054Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:57.0953635Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:57.0989970Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:57.0990434Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:57.1003898Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:57.1004276Z env:
2025-05-07T20:30:57.1004595Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:57.1004916Z   BUILD_ENV: build_binary
2025-05-07T20:30:57.1005178Z   BUILD_TARGET: genai
2025-05-07T20:30:57.1005415Z   BUILD_VARIANT: cuda
2025-05-07T20:30:57.1005652Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:57.1005919Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:57.1006229Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:57.1006561Z ##[endgroup]
2025-05-07T20:30:57.4369232Z ################################################################################
2025-05-07T20:30:57.4369671Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:57.4369935Z #
2025-05-07T20:30:57.4384559Z # [2025-05-07T20:30:57.438Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:57.4385120Z ################################################################################
2025-05-07T20:30:57.4385416Z 
2025-05-07T20:31:05.4310524Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:31:05.4311100Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:31:05.4311496Z [TEST] Determined the test directories:
2025-05-07T20:31:05.4311811Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:31:05.4312108Z fbgemm_gpu/experimental/example/test
2025-05-07T20:31:05.4312408Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:31:05.4312592Z 
2025-05-07T20:31:05.4318786Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:31:05.4325416Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:31:05.4325850Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:31:05.4326140Z 
2025-05-07T20:31:05.8538982Z 
2025-05-07T20:31:05.8539462Z [TEST] Installing PyTest ...
2025-05-07T20:31:05.8564647Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:31:06.9649559Z Channels:
2025-05-07T20:31:06.9649879Z  - conda-forge
2025-05-07T20:31:06.9650192Z Platform: linux-64
2025-05-07T20:31:10.2443358Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:31:11.4010471Z Solving environment: \ | / done
2025-05-07T20:31:11.6315595Z 
2025-05-07T20:31:11.6315872Z ## Package Plan ##
2025-05-07T20:31:11.6316105Z 
2025-05-07T20:31:11.6316397Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:31:11.6316746Z 
2025-05-07T20:31:11.6316851Z   added / updated specs:
2025-05-07T20:31:11.6317094Z     - expecttest
2025-05-07T20:31:11.6317313Z     - pytest
2025-05-07T20:31:11.6317435Z 
2025-05-07T20:31:11.6317439Z 
2025-05-07T20:31:11.6317563Z The following packages will be downloaded:
2025-05-07T20:31:11.6317828Z 
2025-05-07T20:31:11.6317991Z     package                    |            build
2025-05-07T20:31:11.6318447Z     ---------------------------|-----------------
2025-05-07T20:31:11.6318886Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:31:11.6319532Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:31:11.6320167Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:31:11.6320740Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:31:11.6321176Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:31:11.6321593Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:31:11.6322006Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:31:11.6322698Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:31:11.6323094Z     ------------------------------------------------------------
2025-05-07T20:31:11.6323427Z                                            Total:         428 KB
2025-05-07T20:31:11.6323787Z 
2025-05-07T20:31:11.6323912Z The following NEW packages will be INSTALLED:
2025-05-07T20:31:11.6324128Z 
2025-05-07T20:31:11.6324331Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:31:11.6324838Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:31:11.6325347Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:31:11.6325831Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:31:11.6326291Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:31:11.6326739Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:31:11.6327167Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:31:11.6327588Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:31:11.6327842Z 
2025-05-07T20:31:11.6327846Z 
2025-05-07T20:31:11.6327850Z 
2025-05-07T20:31:11.6328003Z Downloading and Extracting Packages: ...working...
2025-05-07T20:31:11.6328368Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:31:11.6328591Z 
2025-05-07T20:31:11.6328858Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:31:11.6329098Z 
2025-05-07T20:31:11.6329107Z 
2025-05-07T20:31:11.6342773Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:31:11.6343030Z 
2025-05-07T20:31:11.6343034Z 
2025-05-07T20:31:11.6343044Z 
2025-05-07T20:31:11.6355933Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:31:11.6356194Z 
2025-05-07T20:31:11.6356205Z 
2025-05-07T20:31:11.6356209Z 
2025-05-07T20:31:11.6357753Z 
2025-05-07T20:31:11.6367779Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:31:11.6368073Z 
2025-05-07T20:31:11.6368078Z 
2025-05-07T20:31:11.6368088Z 
2025-05-07T20:31:11.6368093Z 
2025-05-07T20:31:11.6368103Z 
2025-05-07T20:31:11.6369085Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:31:11.6369360Z 
2025-05-07T20:31:11.6369365Z 
2025-05-07T20:31:11.6369368Z 
2025-05-07T20:31:11.6369386Z 
2025-05-07T20:31:11.6369389Z 
2025-05-07T20:31:11.6369393Z 
2025-05-07T20:31:11.6372824Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:31:11.6373109Z 
2025-05-07T20:31:11.6373120Z 
2025-05-07T20:31:11.6373124Z 
2025-05-07T20:31:11.6373128Z 
2025-05-07T20:31:11.6373132Z 
2025-05-07T20:31:11.6373135Z 
2025-05-07T20:31:11.6376088Z 
2025-05-07T20:31:11.7088906Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:31:11.7089219Z 
2025-05-07T20:31:11.7089223Z 
2025-05-07T20:31:11.7089894Z 
2025-05-07T20:31:11.7870377Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:11.7870757Z 
2025-05-07T20:31:11.7870763Z 
2025-05-07T20:31:11.7870766Z 
2025-05-07T20:31:11.7871127Z 
2025-05-07T20:31:11.7883033Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:31:11.7883383Z 
2025-05-07T20:31:11.7883387Z 
2025-05-07T20:31:11.7883391Z 
2025-05-07T20:31:11.7883394Z 
2025-05-07T20:31:11.7883398Z 
2025-05-07T20:31:11.7921512Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:31:11.7921788Z 
2025-05-07T20:31:11.7921792Z 
2025-05-07T20:31:11.7921795Z 
2025-05-07T20:31:11.7927896Z 
2025-05-07T20:31:11.7961822Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:11.7962196Z 
2025-05-07T20:31:11.7962202Z 
2025-05-07T20:31:11.7962207Z 
2025-05-07T20:31:11.7962212Z 
2025-05-07T20:31:11.7965261Z 
2025-05-07T20:31:11.8719761Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:11.8720112Z 
2025-05-07T20:31:11.8720116Z 
2025-05-07T20:31:11.8720120Z 
2025-05-07T20:31:11.8720418Z 
2025-05-07T20:31:11.8720425Z 
2025-05-07T20:31:11.8730668Z 
2025-05-07T20:31:11.8740682Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:11.8741178Z 
2025-05-07T20:31:11.8741182Z 
2025-05-07T20:31:11.8741186Z 
2025-05-07T20:31:11.8741189Z 
2025-05-07T20:31:11.8741196Z 
2025-05-07T20:31:11.8741200Z 
2025-05-07T20:31:11.8742491Z 
2025-05-07T20:31:11.8775465Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:11.8775754Z 
2025-05-07T20:31:11.8775758Z 
2025-05-07T20:31:11.8775761Z 
2025-05-07T20:31:11.8775765Z 
2025-05-07T20:31:11.8775769Z 
2025-05-07T20:31:11.8775776Z 
2025-05-07T20:31:11.8783786Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:11.8784180Z 
2025-05-07T20:31:11.8784186Z 
2025-05-07T20:31:11.8784191Z 
2025-05-07T20:31:11.8784197Z 
2025-05-07T20:31:11.8784211Z 
2025-05-07T20:31:11.8784217Z 
2025-05-07T20:31:11.8784222Z 
2025-05-07T20:31:11.8786369Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:11.8786738Z 
2025-05-07T20:31:11.8789049Z 
2025-05-07T20:31:11.9061737Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:31:11.9062400Z 
2025-05-07T20:31:11.9113372Z 
2025-05-07T20:31:11.9621830Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:11.9622177Z 
2025-05-07T20:31:11.9622183Z 
2025-05-07T20:31:11.9622723Z 
2025-05-07T20:31:11.9631859Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:11.9632203Z 
2025-05-07T20:31:11.9632209Z 
2025-05-07T20:31:11.9632746Z 
2025-05-07T20:31:11.9703497Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:11.9703858Z 
2025-05-07T20:31:11.9703865Z 
2025-05-07T20:31:11.9703870Z 
2025-05-07T20:31:11.9703885Z 
2025-05-07T20:31:11.9703891Z 
2025-05-07T20:31:11.9718488Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:11.9718838Z 
2025-05-07T20:31:11.9718856Z 
2025-05-07T20:31:11.9718862Z 
2025-05-07T20:31:11.9718875Z 
2025-05-07T20:31:11.9805173Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:11.9997470Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:31:11.9997812Z 
2025-05-07T20:31:11.9997818Z 
2025-05-07T20:31:11.9997823Z 
2025-05-07T20:31:11.9997828Z 
2025-05-07T20:31:11.9997833Z 
2025-05-07T20:31:11.9997838Z 
2025-05-07T20:31:12.0016406Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:12.0087066Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:12.0087413Z 
2025-05-07T20:31:12.0087419Z 
2025-05-07T20:31:12.0087427Z 
2025-05-07T20:31:12.0087605Z 
2025-05-07T20:31:12.0087612Z 
2025-05-07T20:31:12.0087617Z 
2025-05-07T20:31:12.0087626Z 
2025-05-07T20:31:12.0161055Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:12.0161433Z 
2025-05-07T20:31:12.0162497Z 
2025-05-07T20:31:12.0165188Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:12.0165694Z 
2025-05-07T20:31:12.0165700Z 
2025-05-07T20:31:12.0250016Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:12.0250629Z 
2025-05-07T20:31:12.0274241Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:31:12.0274579Z 
2025-05-07T20:31:12.0435866Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:12.0436534Z 
2025-05-07T20:31:12.0445094Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:12.0451082Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:12.0451553Z                                                      
2025-05-07T20:31:12.0451858Z 
2025-05-07T20:31:12.0452122Z                                                      [A
2025-05-07T20:31:12.0452392Z 
2025-05-07T20:31:12.0452397Z 
2025-05-07T20:31:12.0452618Z                                                      [A[A
2025-05-07T20:31:12.0452907Z 
2025-05-07T20:31:12.0453176Z 
2025-05-07T20:31:12.0453183Z 
2025-05-07T20:31:12.0453426Z                                                      [A[A[A
2025-05-07T20:31:12.0453716Z 
2025-05-07T20:31:12.0453721Z 
2025-05-07T20:31:12.0453892Z 
2025-05-07T20:31:12.0453896Z 
2025-05-07T20:31:12.0454081Z                                                      [A[A[A[A
2025-05-07T20:31:12.0454292Z 
2025-05-07T20:31:12.0454295Z 
2025-05-07T20:31:12.0454299Z 
2025-05-07T20:31:12.0454302Z 
2025-05-07T20:31:12.0454306Z 
2025-05-07T20:31:12.0454479Z                                                      [A[A[A[A[A
2025-05-07T20:31:12.0454690Z 
2025-05-07T20:31:12.0454693Z 
2025-05-07T20:31:12.0454697Z 
2025-05-07T20:31:12.0454700Z 
2025-05-07T20:31:12.0454704Z 
2025-05-07T20:31:12.0454707Z 
2025-05-07T20:31:12.0454882Z                                                      [A[A[A[A[A[A
2025-05-07T20:31:12.0455091Z 
2025-05-07T20:31:12.0455100Z 
2025-05-07T20:31:12.0455111Z 
2025-05-07T20:31:12.0455115Z 
2025-05-07T20:31:12.0455126Z 
2025-05-07T20:31:12.0455130Z 
2025-05-07T20:31:12.0455133Z 
2025-05-07T20:31:12.0455318Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:31:12.1457461Z Preparing transaction: \ done
2025-05-07T20:31:12.2462931Z Verifying transaction: / done
2025-05-07T20:31:14.1490946Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:31:14.2763428Z [TEST] Checking imports ...
2025-05-07T20:31:18.2560658Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:31:18.2573366Z [TEST] Setting feature flags ...
2025-05-07T20:31:18.2573899Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:31:18.2574301Z 
2025-05-07T20:31:18.6788845Z 
2025-05-07T20:31:18.6789538Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:31:18.6790687Z ################################################################################
2025-05-07T20:31:18.6791133Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:31:18.6791423Z #
2025-05-07T20:31:18.6810447Z # [2025-05-07T20:31:18.680Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:31:18.6810984Z ################################################################################
2025-05-07T20:31:18.6811231Z 
2025-05-07T20:31:18.6820421Z [TEST] Enumerating ALL test files ...
2025-05-07T20:31:18.6849075Z ./attention/gqa_test.py
2025-05-07T20:31:18.6849351Z ./coalesce/coalesce_test.py
2025-05-07T20:31:18.6849621Z ./comm/multi_gpu_car_test.py
2025-05-07T20:31:18.6849896Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:18.6850191Z ./kv_cache/kv_cache_test.py
2025-05-07T20:31:18.6850437Z ./moe/activation_test.py
2025-05-07T20:31:18.6850691Z ./moe/gather_scatter_test.py
2025-05-07T20:31:18.6850941Z ./moe/layers_test.py
2025-05-07T20:31:18.6851164Z ./moe/shuffling_test.py
2025-05-07T20:31:18.6851406Z ./quantize/quantize_test.py
2025-05-07T20:31:18.6851568Z 
2025-05-07T20:31:18.6851700Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:31:18.6851908Z 
2025-05-07T20:31:18.6869766Z ################################################################################
2025-05-07T20:31:18.6885050Z # [2025-05-07T20:31:18.688Z] Run Python Test Suite:
2025-05-07T20:31:18.6885448Z #   ./attention/gqa_test.py
2025-05-07T20:31:18.6885821Z ################################################################################
2025-05-07T20:31:18.6909011Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:31:18.6909619Z 
2025-05-07T20:31:21.2393048Z ============================= test session starts ==============================
2025-05-07T20:31:21.2394603Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:21.2395218Z cachedir: .pytest_cache
2025-05-07T20:31:21.2396194Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:21.2396930Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:21.2397481Z plugins: hypothesis-6.131.14
2025-05-07T20:31:22.9177081Z collecting ... collected 2 items
2025-05-07T20:31:22.9177364Z 
2025-05-07T20:31:58.4717320Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:58.4720005Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4720685Z     int4_kv=False,
2025-05-07T20:31:58.4720986Z     num_groups=1,
2025-05-07T20:31:58.4721238Z     B=1,
2025-05-07T20:31:58.4721469Z     MAX_T=4,
2025-05-07T20:31:58.4721708Z     N_H_L=1,
2025-05-07T20:31:58.4726011Z )
2025-05-07T20:31:58.4726323Z Trying example: test_gqa(
2025-05-07T20:31:58.4726716Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4727104Z     int4_kv=True,
2025-05-07T20:31:58.4727366Z     num_groups=1,
2025-05-07T20:31:58.4727678Z     B=1,
2025-05-07T20:31:58.4727905Z     MAX_T=4,
2025-05-07T20:31:58.4728144Z     N_H_L=1,
2025-05-07T20:31:58.4728380Z )
2025-05-07T20:31:58.4728617Z Trying example: test_gqa(
2025-05-07T20:31:58.4728995Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4729382Z     int4_kv=True,
2025-05-07T20:31:58.4729639Z     num_groups=4,
2025-05-07T20:31:58.4729887Z     B=23,
2025-05-07T20:31:58.4730117Z     MAX_T=33,
2025-05-07T20:31:58.4730358Z     N_H_L=68,
2025-05-07T20:31:58.4730587Z )
2025-05-07T20:31:58.4730834Z Trying example: test_gqa(
2025-05-07T20:31:58.4731188Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4731561Z     int4_kv=True,
2025-05-07T20:31:58.4731815Z     num_groups=4,
2025-05-07T20:31:58.4732071Z     B=77,
2025-05-07T20:31:58.4732292Z     MAX_T=4,
2025-05-07T20:31:58.4732545Z     N_H_L=1,
2025-05-07T20:31:58.4732779Z )
2025-05-07T20:31:58.4733009Z Trying example: test_gqa(
2025-05-07T20:31:58.4733369Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4733749Z     int4_kv=True,
2025-05-07T20:31:58.4734002Z     num_groups=4,
2025-05-07T20:31:58.4734253Z     B=77,
2025-05-07T20:31:58.4734490Z     MAX_T=52,
2025-05-07T20:31:58.4734730Z     N_H_L=67,
2025-05-07T20:31:58.4734963Z )
2025-05-07T20:31:58.4735198Z Trying example: test_gqa(
2025-05-07T20:31:58.4735544Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4735932Z     int4_kv=False,
2025-05-07T20:31:58.4736191Z     num_groups=4,
2025-05-07T20:31:58.4736440Z     B=57,
2025-05-07T20:31:58.4736670Z     MAX_T=45,
2025-05-07T20:31:58.4736912Z     N_H_L=120,
2025-05-07T20:31:58.4737145Z )
2025-05-07T20:31:58.4737384Z Trying example: test_gqa(
2025-05-07T20:31:58.4737746Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4738129Z     int4_kv=True,
2025-05-07T20:31:58.4738384Z     num_groups=4,
2025-05-07T20:31:58.4738629Z     B=52,
2025-05-07T20:31:58.4738859Z     MAX_T=42,
2025-05-07T20:31:58.4739099Z     N_H_L=53,
2025-05-07T20:31:58.4739335Z )
2025-05-07T20:31:58.4739572Z Trying example: test_gqa(
2025-05-07T20:31:58.4739921Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4740296Z     int4_kv=True,
2025-05-07T20:31:58.4740556Z     num_groups=1,
2025-05-07T20:31:58.4740805Z     B=77,
2025-05-07T20:31:58.4741028Z     MAX_T=95,
2025-05-07T20:31:58.4741265Z     N_H_L=53,
2025-05-07T20:31:58.4741564Z )
2025-05-07T20:31:58.4742016Z Trying example: test_gqa(
2025-05-07T20:31:58.4742478Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4742938Z     int4_kv=True,
2025-05-07T20:31:58.4743325Z     num_groups=4,
2025-05-07T20:31:58.4743679Z     B=113,
2025-05-07T20:31:58.4743990Z     MAX_T=48,
2025-05-07T20:31:58.4754518Z     N_H_L=96,
2025-05-07T20:31:58.4754902Z )
2025-05-07T20:31:58.4755178Z Trying example: test_gqa(
2025-05-07T20:31:58.4755570Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4756068Z     int4_kv=False,
2025-05-07T20:31:58.4756344Z     num_groups=1,
2025-05-07T20:31:58.4756974Z     B=51,
2025-05-07T20:31:58.4757222Z     MAX_T=61,
2025-05-07T20:31:58.4757470Z     N_H_L=69,
2025-05-07T20:31:58.4757708Z )
2025-05-07T20:31:58.4757954Z Trying example: test_gqa(
2025-05-07T20:31:58.4758555Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4758938Z     int4_kv=False,
2025-05-07T20:31:58.4759202Z     num_groups=4,
2025-05-07T20:31:58.4759461Z     B=17,
2025-05-07T20:31:58.4759695Z     MAX_T=113,
2025-05-07T20:31:58.4759940Z     N_H_L=65,
2025-05-07T20:31:58.4760181Z )
2025-05-07T20:31:58.4760423Z Trying example: test_gqa(
2025-05-07T20:31:58.4760771Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4761159Z     int4_kv=False,
2025-05-07T20:31:58.4761420Z     num_groups=4,
2025-05-07T20:31:58.4761667Z     B=17,
2025-05-07T20:31:58.4761899Z     MAX_T=65,
2025-05-07T20:31:58.4762141Z     N_H_L=65,
2025-05-07T20:31:58.4762376Z )
2025-05-07T20:31:58.4762616Z Trying example: test_gqa(
2025-05-07T20:31:58.4762981Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4763359Z     int4_kv=False,
2025-05-07T20:31:58.4763620Z     num_groups=4,
2025-05-07T20:31:58.4763878Z     B=65,
2025-05-07T20:31:58.4764113Z     MAX_T=65,
2025-05-07T20:31:58.4764357Z     N_H_L=65,
2025-05-07T20:31:58.4764595Z )
2025-05-07T20:31:58.4764852Z Trying example: test_gqa(
2025-05-07T20:31:58.4765229Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4765927Z     int4_kv=False,
2025-05-07T20:31:58.4766183Z     num_groups=1,
2025-05-07T20:31:58.4766468Z     B=6,
2025-05-07T20:31:58.4766722Z     MAX_T=108,
2025-05-07T20:31:58.4766978Z     N_H_L=14,
2025-05-07T20:31:58.4767232Z )
2025-05-07T20:31:58.4767484Z Trying example: test_gqa(
2025-05-07T20:31:58.4767814Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4768137Z     int4_kv=False,
2025-05-07T20:31:58.4768355Z     num_groups=1,
2025-05-07T20:31:58.4768563Z     B=6,
2025-05-07T20:31:58.4768751Z     MAX_T=14,
2025-05-07T20:31:58.4768963Z     N_H_L=14,
2025-05-07T20:31:58.4769162Z )
2025-05-07T20:31:58.4769351Z Trying example: test_gqa(
2025-05-07T20:31:58.4769648Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4769969Z     int4_kv=False,
2025-05-07T20:31:58.4770180Z     num_groups=1,
2025-05-07T20:31:58.4770394Z     B=6,
2025-05-07T20:31:58.4770588Z     MAX_T=6,
2025-05-07T20:31:58.4770782Z     N_H_L=14,
2025-05-07T20:31:58.4770979Z )
2025-05-07T20:31:58.4771175Z Trying example: test_gqa(
2025-05-07T20:31:58.4771465Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4771781Z     int4_kv=False,
2025-05-07T20:31:58.4771997Z     num_groups=1,
2025-05-07T20:31:58.4772201Z     B=6,
2025-05-07T20:31:58.4772392Z     MAX_T=6,
2025-05-07T20:31:58.4772589Z     N_H_L=6,
2025-05-07T20:31:58.4772777Z )
2025-05-07T20:31:58.4772979Z Trying example: test_gqa(
2025-05-07T20:31:58.4773270Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4773579Z     int4_kv=False,
2025-05-07T20:31:58.4773796Z     num_groups=1,
2025-05-07T20:31:58.4774006Z     B=70,
2025-05-07T20:31:58.4774190Z     MAX_T=94,
2025-05-07T20:31:58.4774389Z     N_H_L=78,
2025-05-07T20:31:58.4774584Z )
2025-05-07T20:31:58.4774779Z Trying example: test_gqa(
2025-05-07T20:31:58.4775069Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4775386Z     int4_kv=False,
2025-05-07T20:31:58.4775592Z     num_groups=1,
2025-05-07T20:31:58.4775803Z     B=78,
2025-05-07T20:31:58.4775991Z     MAX_T=94,
2025-05-07T20:31:58.4776190Z     N_H_L=78,
2025-05-07T20:31:58.4776381Z )
2025-05-07T20:31:58.4776578Z Trying example: test_gqa(
2025-05-07T20:31:58.4776866Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4777179Z     int4_kv=False,
2025-05-07T20:31:58.4777393Z     num_groups=1,
2025-05-07T20:31:58.4777600Z     B=94,
2025-05-07T20:31:58.4777782Z     MAX_T=94,
2025-05-07T20:31:58.4777980Z     N_H_L=78,
2025-05-07T20:31:58.4778173Z )
2025-05-07T20:31:58.4778358Z Trying example: test_gqa(
2025-05-07T20:31:58.4778800Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4779119Z     int4_kv=False,
2025-05-07T20:31:58.4779325Z     num_groups=1,
2025-05-07T20:31:58.4779532Z     B=94,
2025-05-07T20:31:58.4779840Z     MAX_T=94,
2025-05-07T20:31:58.4780032Z     N_H_L=94,
2025-05-07T20:31:58.4780227Z )
2025-05-07T20:31:58.4780423Z Trying example: test_gqa(
2025-05-07T20:31:58.4780710Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4781028Z     int4_kv=False,
2025-05-07T20:31:58.4781243Z     num_groups=4,
2025-05-07T20:31:58.4781445Z     B=41,
2025-05-07T20:31:58.4781637Z     MAX_T=105,
2025-05-07T20:31:58.4781842Z     N_H_L=126,
2025-05-07T20:31:58.4782035Z )
2025-05-07T20:31:58.4782231Z Trying example: test_gqa(
2025-05-07T20:31:58.4782523Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4782830Z     int4_kv=False,
2025-05-07T20:31:58.4783041Z     num_groups=4,
2025-05-07T20:31:58.4783255Z     B=105,
2025-05-07T20:31:58.4783454Z     MAX_T=105,
2025-05-07T20:31:58.4783655Z     N_H_L=126,
2025-05-07T20:31:58.4783857Z )
2025-05-07T20:31:58.4784055Z Trying example: test_gqa(
2025-05-07T20:31:58.4784338Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4784661Z     int4_kv=False,
2025-05-07T20:31:58.4784865Z     num_groups=4,
2025-05-07T20:31:58.4785070Z     B=105,
2025-05-07T20:31:58.4785262Z     MAX_T=105,
2025-05-07T20:31:58.4785463Z     N_H_L=105,
2025-05-07T20:31:58.4785652Z )
2025-05-07T20:31:58.4785844Z Trying example: test_gqa(
2025-05-07T20:31:58.4786134Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4786438Z     int4_kv=True,
2025-05-07T20:31:58.4786643Z     num_groups=1,
2025-05-07T20:31:58.4786848Z     B=95,
2025-05-07T20:31:58.4787033Z     MAX_T=114,
2025-05-07T20:31:58.4787231Z     N_H_L=43,
2025-05-07T20:31:58.4787424Z )
2025-05-07T20:31:58.4787611Z Trying example: test_gqa(
2025-05-07T20:31:58.4787899Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4788215Z     int4_kv=True,
2025-05-07T20:31:58.4788424Z     num_groups=1,
2025-05-07T20:31:58.4788625Z     B=43,
2025-05-07T20:31:58.4788813Z     MAX_T=114,
2025-05-07T20:31:58.4789013Z     N_H_L=43,
2025-05-07T20:31:58.4789204Z )
2025-05-07T20:31:58.4789398Z Trying example: test_gqa(
2025-05-07T20:31:58.4789687Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4789990Z     int4_kv=True,
2025-05-07T20:31:58.4790200Z     num_groups=1,
2025-05-07T20:31:58.4790407Z     B=43,
2025-05-07T20:31:58.4790589Z     MAX_T=43,
2025-05-07T20:31:58.4790787Z     N_H_L=43,
2025-05-07T20:31:58.4790978Z )
2025-05-07T20:31:58.4791164Z Trying example: test_gqa(
2025-05-07T20:31:58.4791450Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4791760Z     int4_kv=False,
2025-05-07T20:31:58.4791964Z     num_groups=1,
2025-05-07T20:31:58.4792169Z     B=21,
2025-05-07T20:31:58.4792356Z     MAX_T=38,
2025-05-07T20:31:58.4792601Z     N_H_L=42,
2025-05-07T20:31:58.4792792Z )
2025-05-07T20:31:58.4792988Z Trying example: test_gqa(
2025-05-07T20:31:58.4793270Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4793582Z     int4_kv=False,
2025-05-07T20:31:58.4793794Z     num_groups=1,
2025-05-07T20:31:58.4793996Z     B=38,
2025-05-07T20:31:58.4794184Z     MAX_T=38,
2025-05-07T20:31:58.4794387Z     N_H_L=42,
2025-05-07T20:31:58.4794571Z )
2025-05-07T20:31:58.4794766Z Trying example: test_gqa(
2025-05-07T20:31:58.4795059Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4795401Z     int4_kv=False,
2025-05-07T20:31:58.4795627Z     num_groups=1,
2025-05-07T20:31:58.4795937Z     B=38,
2025-05-07T20:31:58.4796125Z     MAX_T=42,
2025-05-07T20:31:58.4796314Z     N_H_L=42,
2025-05-07T20:31:58.4796505Z )
2025-05-07T20:31:58.4796710Z Trying example: test_gqa(
2025-05-07T20:31:58.4796994Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4797306Z     int4_kv=False,
2025-05-07T20:31:58.4797520Z     num_groups=1,
2025-05-07T20:31:58.4797719Z     B=42,
2025-05-07T20:31:58.4798015Z     MAX_T=42,
2025-05-07T20:31:58.4798215Z     N_H_L=42,
2025-05-07T20:31:58.4798406Z )
2025-05-07T20:31:58.4798603Z Trying example: test_gqa(
2025-05-07T20:31:58.4798899Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4799283Z     int4_kv=True,
2025-05-07T20:31:58.4799501Z     num_groups=1,
2025-05-07T20:31:58.4799716Z     B=74,
2025-05-07T20:31:58.4799903Z     MAX_T=20,
2025-05-07T20:31:58.4800109Z     N_H_L=15,
2025-05-07T20:31:58.4800310Z )
2025-05-07T20:31:58.4800501Z Trying example: test_gqa(
2025-05-07T20:31:58.4800795Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4801117Z     int4_kv=True,
2025-05-07T20:31:58.4801321Z     num_groups=1,
2025-05-07T20:31:58.4801534Z     B=20,
2025-05-07T20:31:58.4801727Z     MAX_T=20,
2025-05-07T20:31:58.4801921Z     N_H_L=15,
2025-05-07T20:31:58.4802114Z )
2025-05-07T20:31:58.4802309Z Trying example: test_gqa(
2025-05-07T20:31:58.4802598Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4802917Z     int4_kv=True,
2025-05-07T20:31:58.4803128Z     num_groups=1,
2025-05-07T20:31:58.4803341Z     B=20,
2025-05-07T20:31:58.4803525Z     MAX_T=15,
2025-05-07T20:31:58.4803717Z     N_H_L=15,
2025-05-07T20:31:58.4803919Z )
2025-05-07T20:31:58.4804109Z Trying example: test_gqa(
2025-05-07T20:31:58.4804397Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4804708Z     int4_kv=True,
2025-05-07T20:31:58.4804913Z     num_groups=1,
2025-05-07T20:31:58.4805119Z     B=15,
2025-05-07T20:31:58.4805307Z     MAX_T=20,
2025-05-07T20:31:58.4805500Z     N_H_L=15,
2025-05-07T20:31:58.4805697Z )
2025-05-07T20:31:58.4805894Z Trying example: test_gqa(
2025-05-07T20:31:58.4806179Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4806493Z     int4_kv=True,
2025-05-07T20:31:58.4806703Z     num_groups=1,
2025-05-07T20:31:58.4806900Z     B=15,
2025-05-07T20:31:58.4807091Z     MAX_T=15,
2025-05-07T20:31:58.4807286Z     N_H_L=15,
2025-05-07T20:31:58.4807475Z )
2025-05-07T20:31:58.4807675Z Trying example: test_gqa(
2025-05-07T20:31:58.4807974Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4808281Z     int4_kv=False,
2025-05-07T20:31:58.4808507Z     num_groups=4,
2025-05-07T20:31:58.4808723Z     B=117,
2025-05-07T20:31:58.4808909Z     MAX_T=104,
2025-05-07T20:31:58.4809113Z     N_H_L=69,
2025-05-07T20:31:58.4809316Z )
2025-05-07T20:31:58.4809507Z Trying example: test_gqa(
2025-05-07T20:31:58.4809801Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4810120Z     int4_kv=False,
2025-05-07T20:31:58.4810339Z     num_groups=4,
2025-05-07T20:31:58.4810546Z     B=117,
2025-05-07T20:31:58.4810750Z     MAX_T=117,
2025-05-07T20:31:58.4810951Z     N_H_L=69,
2025-05-07T20:31:58.4811145Z )
2025-05-07T20:31:58.4811347Z Trying example: test_gqa(
2025-05-07T20:31:58.4811636Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4811946Z     int4_kv=False,
2025-05-07T20:31:58.4812163Z     num_groups=4,
2025-05-07T20:31:58.4812377Z     B=69,
2025-05-07T20:31:58.4812565Z     MAX_T=117,
2025-05-07T20:31:58.4812774Z     N_H_L=69,
2025-05-07T20:31:58.4812969Z )
2025-05-07T20:31:58.4813164Z Trying example: test_gqa(
2025-05-07T20:31:58.4813458Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:58.4813768Z     int4_kv=False,
2025-05-07T20:31:58.4813979Z     num_groups=4,
2025-05-07T20:31:58.4814192Z     B=117,
2025-05-07T20:31:58.4814387Z     MAX_T=69,
2025-05-07T20:31:58.4814588Z     N_H_L=69,
2025-05-07T20:31:58.4814795Z )
2025-05-07T20:31:58.4814988Z PASSED
2025-05-07T20:31:58.4917442Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:58.4917782Z 
2025-05-07T20:31:58.4917937Z =========================== short test summary info ============================
2025-05-07T20:31:58.4918665Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:58.4919554Z ======================== 1 passed, 1 skipped in 37.76s =========================
2025-05-07T20:31:59.1534537Z 
2025-05-07T20:31:59.1535118Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:59.1556195Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds
2025-05-07T20:31:59.1556490Z 
2025-05-07T20:31:59.1556495Z 
2025-05-07T20:31:59.1556500Z 
2025-05-07T20:31:59.1556503Z 
2025-05-07T20:31:59.1578711Z ################################################################################
2025-05-07T20:31:59.1594411Z # [2025-05-07T20:31:59.159Z] Run Python Test Suite:
2025-05-07T20:31:59.1594763Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:59.1595060Z ################################################################################
2025-05-07T20:31:59.1619578Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:59.1620202Z 
2025-05-07T20:32:01.3213136Z ============================= test session starts ==============================
2025-05-07T20:32:01.3213779Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:01.3214307Z cachedir: .pytest_cache
2025-05-07T20:32:01.3214887Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:01.3215615Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:01.3216030Z plugins: hypothesis-6.131.14
2025-05-07T20:32:03.0587836Z collecting ... collected 1 item
2025-05-07T20:32:03.8167520Z 
2025-05-07T20:32:03.8167813Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:32:03.8168137Z 
2025-05-07T20:32:03.8168415Z ============================== 1 passed in 2.62s ===============================
2025-05-07T20:32:04.4535417Z 
2025-05-07T20:32:04.4536148Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:32:04.4553687Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:32:04.4554081Z 
2025-05-07T20:32:04.4554087Z 
2025-05-07T20:32:04.4554109Z 
2025-05-07T20:32:04.4554114Z 
2025-05-07T20:32:04.4576521Z ################################################################################
2025-05-07T20:32:04.4593551Z # [2025-05-07T20:32:04.459Z] Run Python Test Suite:
2025-05-07T20:32:04.4593903Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:32:04.4594198Z ################################################################################
2025-05-07T20:32:04.4619630Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:32:04.4620247Z 
2025-05-07T20:32:06.6328558Z ============================= test session starts ==============================
2025-05-07T20:32:06.6329236Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:06.6329761Z cachedir: .pytest_cache
2025-05-07T20:32:06.6330338Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:06.6331088Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:06.6331490Z plugins: hypothesis-6.131.14
2025-05-07T20:32:08.3337235Z collecting ... collected 5 items
2025-05-07T20:32:08.3337656Z 
2025-05-07T20:32:08.3350508Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:32:08.3360237Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:32:08.3369112Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:32:08.3377432Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:32:08.3397452Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:32:08.3397791Z 
2025-05-07T20:32:08.3397945Z =========================== short test summary info ============================
2025-05-07T20:32:08.3398617Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.3399710Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.3400633Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.3401555Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.3402479Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.3403121Z ============================== 5 skipped in 1.83s ==============================
2025-05-07T20:32:08.9158472Z 
2025-05-07T20:32:08.9159116Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:32:08.9179049Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:32:08.9179342Z 
2025-05-07T20:32:08.9179347Z 
2025-05-07T20:32:08.9179350Z 
2025-05-07T20:32:08.9179354Z 
2025-05-07T20:32:08.9199582Z ################################################################################
2025-05-07T20:32:08.9217263Z # [2025-05-07T20:32:08.921Z] Run Python Test Suite:
2025-05-07T20:32:08.9217622Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:08.9217945Z ################################################################################
2025-05-07T20:32:08.9242926Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:08.9243589Z 
2025-05-07T20:32:11.0750946Z ============================= test session starts ==============================
2025-05-07T20:32:11.0751831Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:11.0752353Z cachedir: .pytest_cache
2025-05-07T20:32:11.0752922Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:11.0753649Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:11.0754060Z plugins: hypothesis-6.131.14
2025-05-07T20:32:12.8720152Z collecting ... collected 2 items
2025-05-07T20:32:12.8720365Z 
2025-05-07T20:32:12.8731820Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:32:12.8748989Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:32:12.8749421Z 
2025-05-07T20:32:12.8749580Z =========================== short test summary info ============================
2025-05-07T20:32:12.8750201Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:12.8751042Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:12.8751642Z ============================== 2 skipped in 1.92s ==============================
2025-05-07T20:32:13.4652067Z 
2025-05-07T20:32:13.4652554Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:13.4672625Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:32:13.4672951Z 
2025-05-07T20:32:13.4672955Z 
2025-05-07T20:32:13.4672959Z 
2025-05-07T20:32:13.4673029Z 
2025-05-07T20:32:13.4695183Z ################################################################################
2025-05-07T20:32:13.4710397Z # [2025-05-07T20:32:13.470Z] Run Python Test Suite:
2025-05-07T20:32:13.4710735Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:32:13.4711032Z ################################################################################
2025-05-07T20:32:13.4735502Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:32:13.4736119Z 
2025-05-07T20:32:15.6273008Z ============================= test session starts ==============================
2025-05-07T20:32:15.6273653Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:15.6274177Z cachedir: .pytest_cache
2025-05-07T20:32:15.6274746Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:15.6275498Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:15.6275967Z plugins: hypothesis-6.131.14
2025-05-07T20:32:17.3193626Z collecting ... collected 4 items
2025-05-07T20:32:17.3194033Z 
2025-05-07T20:32:20.0693087Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:32:20.0776540Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:32:20.0872586Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:32:20.0962470Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:32:20.0962838Z 
2025-05-07T20:32:20.0962988Z =========================== short test summary info ============================
2025-05-07T20:32:20.0963696Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:32:20.0964634Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:32:20.0965249Z ============================== 4 skipped in 4.59s ==============================
2025-05-07T20:32:22.0299219Z 
2025-05-07T20:32:22.0299711Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:32:22.0319393Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:32:22.0319689Z 
2025-05-07T20:32:22.0319926Z 
2025-05-07T20:32:22.0319929Z 
2025-05-07T20:32:22.0319989Z 
2025-05-07T20:32:22.0341579Z ################################################################################
2025-05-07T20:32:22.0356695Z # [2025-05-07T20:32:22.035Z] Run Python Test Suite:
2025-05-07T20:32:22.0357067Z #   ./moe/activation_test.py
2025-05-07T20:32:22.0357380Z ################################################################################
2025-05-07T20:32:22.0381974Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:32:22.0382806Z 
2025-05-07T20:32:24.1911920Z ============================= test session starts ==============================
2025-05-07T20:32:24.1912582Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:24.1922032Z cachedir: .pytest_cache
2025-05-07T20:32:24.1922679Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:24.1923417Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:24.1923835Z plugins: hypothesis-6.131.14
2025-05-07T20:32:25.8318015Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:25.9388843Z collecting ... collected 2 items
2025-05-07T20:32:25.9389043Z 
2025-05-07T20:32:31.2858900Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:31.2860121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2861236Z     T=1,
2025-05-07T20:32:31.2861634Z     D=5120,
2025-05-07T20:32:31.2862414Z     contiguous=True,
2025-05-07T20:32:31.2862763Z     compiled=True,
2025-05-07T20:32:31.2863053Z )
2025-05-07T20:32:31.2863334Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2863843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2864226Z     T=4096,
2025-05-07T20:32:31.2864423Z     D=5120,
2025-05-07T20:32:31.2864626Z     contiguous=True,
2025-05-07T20:32:31.2864850Z     compiled=True,
2025-05-07T20:32:31.2865057Z )
2025-05-07T20:32:31.2865262Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2865929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2866321Z     T=4096,
2025-05-07T20:32:31.2866519Z     D=7168,
2025-05-07T20:32:31.2866718Z     contiguous=False,
2025-05-07T20:32:31.2866966Z     compiled=False,
2025-05-07T20:32:31.2867179Z )
2025-05-07T20:32:31.2867375Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2867755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2868152Z     T=4096,
2025-05-07T20:32:31.2868338Z     D=5120,
2025-05-07T20:32:31.2868544Z     contiguous=False,
2025-05-07T20:32:31.2868779Z     compiled=True,
2025-05-07T20:32:31.2868989Z )
2025-05-07T20:32:31.2869190Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2869562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2869944Z     T=1,
2025-05-07T20:32:31.2870131Z     D=7168,
2025-05-07T20:32:31.2870337Z     contiguous=True,
2025-05-07T20:32:31.2870563Z     compiled=True,
2025-05-07T20:32:31.2870773Z )
2025-05-07T20:32:31.2870980Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2871356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2871734Z     T=1,
2025-05-07T20:32:31.2871932Z     D=7168,
2025-05-07T20:32:31.2872142Z     contiguous=False,
2025-05-07T20:32:31.2872372Z     compiled=True,
2025-05-07T20:32:31.2872586Z )
2025-05-07T20:32:31.2872793Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2873173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2873566Z     T=4096,
2025-05-07T20:32:31.2873769Z     D=5120,
2025-05-07T20:32:31.2873971Z     contiguous=False,
2025-05-07T20:32:31.2874214Z     compiled=False,
2025-05-07T20:32:31.2874431Z )
2025-05-07T20:32:31.2874633Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2875015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2875403Z     T=1,
2025-05-07T20:32:31.2875597Z     D=7168,
2025-05-07T20:32:31.2875891Z     contiguous=True,
2025-05-07T20:32:31.2876127Z     compiled=False,
2025-05-07T20:32:31.2876341Z )
2025-05-07T20:32:31.2876539Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2876924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2877310Z     T=2048,
2025-05-07T20:32:31.2877498Z     D=5120,
2025-05-07T20:32:31.2877697Z     contiguous=True,
2025-05-07T20:32:31.2877924Z     compiled=True,
2025-05-07T20:32:31.2878132Z )
2025-05-07T20:32:31.2878339Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2878714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2879089Z     T=2048,
2025-05-07T20:32:31.2879283Z     D=7168,
2025-05-07T20:32:31.2879486Z     contiguous=True,
2025-05-07T20:32:31.2879707Z     compiled=True,
2025-05-07T20:32:31.2879917Z )
2025-05-07T20:32:31.2880119Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2880481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2880873Z     T=2048,
2025-05-07T20:32:31.2881067Z     D=7168,
2025-05-07T20:32:31.2881260Z     contiguous=True,
2025-05-07T20:32:31.2881491Z     compiled=False,
2025-05-07T20:32:31.2881702Z )
2025-05-07T20:32:31.2882067Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2882439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2882823Z     T=128,
2025-05-07T20:32:31.2883021Z     D=5120,
2025-05-07T20:32:31.2883338Z     contiguous=False,
2025-05-07T20:32:31.2883571Z     compiled=True,
2025-05-07T20:32:31.2883782Z )
2025-05-07T20:32:31.2883980Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2884358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2884750Z     T=128,
2025-05-07T20:32:31.2884941Z     D=5120,
2025-05-07T20:32:31.2885147Z     contiguous=True,
2025-05-07T20:32:31.2885378Z     compiled=True,
2025-05-07T20:32:31.2885583Z )
2025-05-07T20:32:31.2885787Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2886166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2886546Z     T=16384,
2025-05-07T20:32:31.2886749Z     D=5120,
2025-05-07T20:32:31.2886988Z     contiguous=False,
2025-05-07T20:32:31.2887232Z     compiled=True,
2025-05-07T20:32:31.2887440Z )
2025-05-07T20:32:31.2887641Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2888016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2888408Z     T=16384,
2025-05-07T20:32:31.2888601Z     D=5120,
2025-05-07T20:32:31.2888804Z     contiguous=False,
2025-05-07T20:32:31.2889041Z     compiled=False,
2025-05-07T20:32:31.2889255Z )
2025-05-07T20:32:31.2889458Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2889825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2890202Z     T=128,
2025-05-07T20:32:31.2890392Z     D=7168,
2025-05-07T20:32:31.2890589Z     contiguous=True,
2025-05-07T20:32:31.2890814Z     compiled=False,
2025-05-07T20:32:31.2891024Z )
2025-05-07T20:32:31.2891225Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2891604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2891981Z     T=128,
2025-05-07T20:32:31.2892180Z     D=7168,
2025-05-07T20:32:31.2892397Z     contiguous=False,
2025-05-07T20:32:31.2892645Z     compiled=False,
2025-05-07T20:32:31.2892854Z )
2025-05-07T20:32:31.2893045Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2893419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2893799Z     T=1,
2025-05-07T20:32:31.2893981Z     D=5120,
2025-05-07T20:32:31.2894179Z     contiguous=False,
2025-05-07T20:32:31.2894406Z     compiled=False,
2025-05-07T20:32:31.2894607Z )
2025-05-07T20:32:31.2894807Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2895176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2895546Z     T=1,
2025-05-07T20:32:31.2895732Z     D=7168,
2025-05-07T20:32:31.2895930Z     contiguous=False,
2025-05-07T20:32:31.2896149Z     compiled=False,
2025-05-07T20:32:31.2896356Z )
2025-05-07T20:32:31.2896556Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2896923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2897304Z     T=4096,
2025-05-07T20:32:31.2897494Z     D=5120,
2025-05-07T20:32:31.2897694Z     contiguous=True,
2025-05-07T20:32:31.2897920Z     compiled=False,
2025-05-07T20:32:31.2898127Z )
2025-05-07T20:32:31.2898330Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2898700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2899079Z     T=128,
2025-05-07T20:32:31.2899272Z     D=7168,
2025-05-07T20:32:31.2899464Z     contiguous=True,
2025-05-07T20:32:31.2899689Z     compiled=True,
2025-05-07T20:32:31.2899897Z )
2025-05-07T20:32:31.2900094Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2900465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2900847Z     T=1,
2025-05-07T20:32:31.2901027Z     D=5120,
2025-05-07T20:32:31.2901238Z     contiguous=False,
2025-05-07T20:32:31.2901473Z     compiled=True,
2025-05-07T20:32:31.2901674Z )
2025-05-07T20:32:31.2901975Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2902352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2902724Z     T=4096,
2025-05-07T20:32:31.2903015Z     D=7168,
2025-05-07T20:32:31.2903214Z     contiguous=True,
2025-05-07T20:32:31.2903431Z     compiled=False,
2025-05-07T20:32:31.2903641Z )
2025-05-07T20:32:31.2903838Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2904208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2904586Z     T=4096,
2025-05-07T20:32:31.2904775Z     D=7168,
2025-05-07T20:32:31.2904971Z     contiguous=False,
2025-05-07T20:32:31.2905188Z     compiled=True,
2025-05-07T20:32:31.2905392Z )
2025-05-07T20:32:31.2905590Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2905952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2906327Z     T=128,
2025-05-07T20:32:31.2906515Z     D=5120,
2025-05-07T20:32:31.2906703Z     contiguous=True,
2025-05-07T20:32:31.2906933Z     compiled=False,
2025-05-07T20:32:31.2907139Z )
2025-05-07T20:32:31.2907332Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2907701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2908082Z     T=128,
2025-05-07T20:32:31.2908265Z     D=5120,
2025-05-07T20:32:31.2908461Z     contiguous=False,
2025-05-07T20:32:31.2908685Z     compiled=False,
2025-05-07T20:32:31.2908891Z )
2025-05-07T20:32:31.2909090Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2909458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2909828Z     T=1,
2025-05-07T20:32:31.2910014Z     D=5120,
2025-05-07T20:32:31.2910211Z     contiguous=True,
2025-05-07T20:32:31.2910435Z     compiled=False,
2025-05-07T20:32:31.2910636Z )
2025-05-07T20:32:31.2910837Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2911207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2911577Z     T=2048,
2025-05-07T20:32:31.2911774Z     D=7168,
2025-05-07T20:32:31.2911971Z     contiguous=False,
2025-05-07T20:32:31.2912193Z     compiled=True,
2025-05-07T20:32:31.2912399Z )
2025-05-07T20:32:31.2912603Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2912971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2913350Z     T=2048,
2025-05-07T20:32:31.2913540Z     D=7168,
2025-05-07T20:32:31.2913735Z     contiguous=False,
2025-05-07T20:32:31.2913962Z     compiled=False,
2025-05-07T20:32:31.2914173Z )
2025-05-07T20:32:31.2914372Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2914745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2915122Z     T=16384,
2025-05-07T20:32:31.2915310Z     D=7168,
2025-05-07T20:32:31.2915512Z     contiguous=False,
2025-05-07T20:32:31.2915803Z     compiled=True,
2025-05-07T20:32:31.2916003Z )
2025-05-07T20:32:31.2916204Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2916575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2916952Z     T=16384,
2025-05-07T20:32:31.2917140Z     D=7168,
2025-05-07T20:32:31.2917335Z     contiguous=True,
2025-05-07T20:32:31.2917567Z     compiled=True,
2025-05-07T20:32:31.2917765Z )
2025-05-07T20:32:31.2917964Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2918332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2918703Z     T=4096,
2025-05-07T20:32:31.2918890Z     D=7168,
2025-05-07T20:32:31.2919091Z     contiguous=True,
2025-05-07T20:32:31.2919307Z     compiled=True,
2025-05-07T20:32:31.2919511Z )
2025-05-07T20:32:31.2919713Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2920075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2920451Z     T=2048,
2025-05-07T20:32:31.2920639Z     D=5120,
2025-05-07T20:32:31.2920837Z     contiguous=False,
2025-05-07T20:32:31.2921064Z     compiled=False,
2025-05-07T20:32:31.2921277Z )
2025-05-07T20:32:31.2921572Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2921944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2922368Z     T=2048,
2025-05-07T20:32:31.2922674Z     D=5120,
2025-05-07T20:32:31.2922863Z     contiguous=True,
2025-05-07T20:32:31.2923087Z     compiled=False,
2025-05-07T20:32:31.2923295Z )
2025-05-07T20:32:31.2923488Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2923866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2924244Z     T=128,
2025-05-07T20:32:31.2924427Z     D=7168,
2025-05-07T20:32:31.2924628Z     contiguous=False,
2025-05-07T20:32:31.2924855Z     compiled=True,
2025-05-07T20:32:31.2925052Z )
2025-05-07T20:32:31.2925253Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2925626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2925999Z     T=16384,
2025-05-07T20:32:31.2926200Z     D=5120,
2025-05-07T20:32:31.2926404Z     contiguous=True,
2025-05-07T20:32:31.2926622Z     compiled=True,
2025-05-07T20:32:31.2926828Z )
2025-05-07T20:32:31.2927033Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2927397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2927779Z     T=2048,
2025-05-07T20:32:31.2927972Z     D=5120,
2025-05-07T20:32:31.2928166Z     contiguous=False,
2025-05-07T20:32:31.2928394Z     compiled=True,
2025-05-07T20:32:31.2928597Z )
2025-05-07T20:32:31.2928794Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2929165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2929542Z     T=16384,
2025-05-07T20:32:31.2929743Z     D=5120,
2025-05-07T20:32:31.2929936Z     contiguous=True,
2025-05-07T20:32:31.2930160Z     compiled=False,
2025-05-07T20:32:31.2930369Z )
2025-05-07T20:32:31.2930560Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2930928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2931310Z     T=16384,
2025-05-07T20:32:31.2931501Z     D=7168,
2025-05-07T20:32:31.2931702Z     contiguous=False,
2025-05-07T20:32:31.2931929Z     compiled=False,
2025-05-07T20:32:31.2932130Z )
2025-05-07T20:32:31.2932341Z Trying example: test_silu_mul(
2025-05-07T20:32:31.2932709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.2933080Z     T=16384,
2025-05-07T20:32:31.2933279Z     D=7168,
2025-05-07T20:32:31.2933484Z     contiguous=True,
2025-05-07T20:32:31.2933702Z     compiled=False,
2025-05-07T20:32:31.2933910Z )
2025-05-07T20:32:31.2934092Z PASSED
2025-05-07T20:32:31.3518374Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.3519676Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.3521043Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.3522502Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.3523483Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3524784Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.3526498Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3527491Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3528884Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.3530260Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3531327Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3532614Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.3533861Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.3535077Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.3536283Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.3537108Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3538133Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:31.3539148Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.3539940Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:31.3541144Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.3542427Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.3543542Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.3544581Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.3545770Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.3547133Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.3548199Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3549196Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3549937Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.3551033Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.3678216Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.3679454Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.3680803Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.3682258Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.3683285Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3684587Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.3685972Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3686960Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3688194Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.3689585Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3690651Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3691937Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.3693181Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.3694398Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.3695610Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.3696438Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.3697793Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:31.3698810Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.3699737Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:31.3700944Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.3702221Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.3703342Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.3704374Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.3705551Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.3706908Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.3707971Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3708876Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3709622Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.3710645Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4064707Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.4066114Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.4067449Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.4068903Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.4069882Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4071206Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.4072587Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4073907Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4075139Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.4076720Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4077789Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4079069Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.4080322Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.4081533Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.4082754Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.4083585Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4084619Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:31.4085642Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.4086435Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:31.4087647Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.4088929Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.4090042Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.4100467Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.4101685Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.4103064Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.4104129Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4105050Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.4105795Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.4107399Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4109161Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.4110449Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.4111785Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.4113206Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.4114196Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4115495Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.4117001Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4117983Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4119216Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.4120599Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4121669Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4122963Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.4124213Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.4125437Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.4126654Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.4127474Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.4128500Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:31.4129523Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.4130406Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:31.4131619Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.4132972Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.4134093Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.4135136Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.4136319Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.4137678Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.4138739Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4139652Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.4140388Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.4141408Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8313503Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8314418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8314867Z     T=1,
2025-05-07T20:32:31.8315064Z     D=5120,
2025-05-07T20:32:31.8315259Z     scale_ub=None,
2025-05-07T20:32:31.8315481Z     contiguous=True,
2025-05-07T20:32:31.8315796Z     compiled=True,
2025-05-07T20:32:31.8316006Z )
2025-05-07T20:32:31.8316328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8316818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.8317078Z 
2025-05-07T20:32:31.8317161Z     @given(
2025-05-07T20:32:31.8317403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8317720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8318022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8318374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8318707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8319002Z     )
2025-05-07T20:32:31.8319358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8319806Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8320057Z         self,
2025-05-07T20:32:31.8320253Z         T: int,
2025-05-07T20:32:31.8320464Z         D: int,
2025-05-07T20:32:31.8320700Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8320976Z         contiguous: bool,
2025-05-07T20:32:31.8321229Z         compiled: bool,
2025-05-07T20:32:31.8321467Z     ) -> None:
2025-05-07T20:32:31.8321689Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8321948Z     
2025-05-07T20:32:31.8322238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8322584Z     
2025-05-07T20:32:31.8322791Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8323422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8323751Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8323995Z         x0 = x[:, :D]
2025-05-07T20:32:31.8324221Z         x1 = x[:, D:]
2025-05-07T20:32:31.8324594Z     
2025-05-07T20:32:31.8324780Z         if contiguous:
2025-05-07T20:32:31.8325022Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8325289Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8325525Z     
2025-05-07T20:32:31.8325724Z         if scale_ub is not None:
2025-05-07T20:32:31.8326004Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8326341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8326659Z             )
2025-05-07T20:32:31.8326865Z         else:
2025-05-07T20:32:31.8327074Z             scale_ub_tensor = None
2025-05-07T20:32:31.8327332Z     
2025-05-07T20:32:31.8327569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8327887Z             op = silu_mul_quant
2025-05-07T20:32:31.8328152Z             if compiled:
2025-05-07T20:32:31.8328406Z                 op = torch.compile(op)
2025-05-07T20:32:31.8328707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8328989Z     
2025-05-07T20:32:31.8329186Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.8329474Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.8329761Z     
2025-05-07T20:32:31.8330002Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8330341Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.8330629Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.8330945Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.8331309Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8331619Z     
2025-05-07T20:32:31.8331828Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.8332023Z 
2025-05-07T20:32:31.8332133Z moe/activation_test.py:126: 
2025-05-07T20:32:31.8332448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8332823Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.8333161Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8333960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.8334723Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.8335272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8335956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8336650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.8337377Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8338118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.8338766Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.8339381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.8339903Z     fn()
2025-05-07T20:32:31.8340409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.8340994Z     self.fn.run(
2025-05-07T20:32:31.8341467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8342001Z     kernel = self.compile(
2025-05-07T20:32:31.8342548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8343300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8343707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8343939Z 
2025-05-07T20:32:31.8344150Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed2150>
2025-05-07T20:32:31.8345315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8346808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04acc360>}
2025-05-07T20:32:31.8348155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8349186Z context = <triton._C.libtriton.ir.context object at 0x7f5c04abddf0>
2025-05-07T20:32:31.8349474Z 
2025-05-07T20:32:31.8349641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8350164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8350643Z                            module_map=module_map)
2025-05-07T20:32:31.8351013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8351367Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.8351638Z E       ^
2025-05-07T20:32:31.8352105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8352557Z 
2025-05-07T20:32:31.8352979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8353489Z 
2025-05-07T20:32:31.8353594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8354016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8354421Z     T=2048,
2025-05-07T20:32:31.8354609Z     D=5120,
2025-05-07T20:32:31.8354811Z     scale_ub=1200.0,
2025-05-07T20:32:31.8355041Z     contiguous=True,
2025-05-07T20:32:31.8355261Z     compiled=False,
2025-05-07T20:32:31.8355469Z )
2025-05-07T20:32:32.1222760Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.1224023Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.1225372Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.1226839Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.1227844Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1229154Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.1230534Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1231827Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1233067Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.1234603Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1235767Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1237056Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.1238308Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.1239534Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.1240755Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.1241589Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1242609Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:32.1243681Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.1244478Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:32.1245696Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.1246984Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.1248094Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.1249139Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.1250318Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.1251687Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.1252768Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1253701Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1254446Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.1255553Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.2022273Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.2023628Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.2024957Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.2026421Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.2027403Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.2028724Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.2030107Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.2031088Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.2032320Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.2033697Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.2034766Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.2036143Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.2037389Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.2038609Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.2039825Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.2040651Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.2041675Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:32.2042699Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.2043820Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:32.2045036Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.2046467Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.2047578Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.2048620Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.2049805Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.2051178Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.2052245Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.2053154Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.2053895Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.2054923Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4328932Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.4330392Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.4331734Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.4333189Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.4334182Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4335486Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.4336866Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4337856Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4339450Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.4340833Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4342050Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4343327Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.4344574Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.4345801Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.4347012Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.4347846Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4348862Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:32.4349879Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.4350674Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:32.4351889Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.4353175Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.4354292Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.4355337Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.4356609Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.4357968Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.4359040Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4359953Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4360694Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.4361713Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4430921Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.4431982Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.4433489Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.4434912Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.4435973Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4437287Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.4438670Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4439658Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4440887Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.4442269Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4443391Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4444680Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.4445922Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.4447144Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.4448363Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.4449194Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.4450219Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:32.4451245Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.4452041Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:32.4453342Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.4454633Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.4455821Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.4456863Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.4458039Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.4459408Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.4460474Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4461387Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4462131Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.4463203Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7858292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7858925Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7859324Z 
2025-05-07T20:32:32.7859477Z     @given(
2025-05-07T20:32:32.7859717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7860038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7860362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7860699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7861020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7861311Z     )
2025-05-07T20:32:32.7861661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7862105Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7862354Z         self,
2025-05-07T20:32:32.7862555Z         T: int,
2025-05-07T20:32:32.7862749Z         D: int,
2025-05-07T20:32:32.7862971Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7863249Z         contiguous: bool,
2025-05-07T20:32:32.7863483Z         compiled: bool,
2025-05-07T20:32:32.7863717Z     ) -> None:
2025-05-07T20:32:32.7863945Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7864183Z     
2025-05-07T20:32:32.7864463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7864818Z     
2025-05-07T20:32:32.7865026Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7865312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7865883Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7866127Z         x0 = x[:, :D]
2025-05-07T20:32:32.7866343Z         x1 = x[:, D:]
2025-05-07T20:32:32.7866556Z     
2025-05-07T20:32:32.7866749Z         if contiguous:
2025-05-07T20:32:32.7866978Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7867248Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7867497Z     
2025-05-07T20:32:32.7867691Z         if scale_ub is not None:
2025-05-07T20:32:32.7867975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7868319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7868628Z             )
2025-05-07T20:32:32.7869161Z         else:
2025-05-07T20:32:32.7869385Z             scale_ub_tensor = None
2025-05-07T20:32:32.7869639Z     
2025-05-07T20:32:32.7869884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7870405Z             op = silu_mul_quant
2025-05-07T20:32:32.7870664Z             if compiled:
2025-05-07T20:32:32.7870916Z                 op = torch.compile(op)
2025-05-07T20:32:32.7871216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7871496Z     
2025-05-07T20:32:32.7871689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7871859Z 
2025-05-07T20:32:32.7871962Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7872265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7872599Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7872889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7873591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7874282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7874812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7875499Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7876261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7876788Z     kernel = self.compile(
2025-05-07T20:32:32.7877326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7877983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7878382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7878614Z 
2025-05-07T20:32:32.7878824Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed1970>
2025-05-07T20:32:32.7879907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7881301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c054f1e40>}
2025-05-07T20:32:32.7882643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7883722Z context = <triton._C.libtriton.ir.context object at 0x7f5c05228d70>
2025-05-07T20:32:32.7884009Z 
2025-05-07T20:32:32.7884176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7884703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7885195Z                            module_map=module_map)
2025-05-07T20:32:32.7885567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7885922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7886198Z E       ^
2025-05-07T20:32:32.7886665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7887116Z 
2025-05-07T20:32:32.7887539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7888047Z 
2025-05-07T20:32:32.7888154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7888576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7888995Z     T=2048,
2025-05-07T20:32:32.7889187Z     D=5120,
2025-05-07T20:32:32.7889394Z     scale_ub=1200.0,
2025-05-07T20:32:32.7889717Z     contiguous=True,
2025-05-07T20:32:32.7889954Z     compiled=True,
2025-05-07T20:32:32.7890166Z )
2025-05-07T20:32:32.7890489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7891070Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7891340Z 
2025-05-07T20:32:32.7891422Z     @given(
2025-05-07T20:32:32.7891664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7891981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7892286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7892668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7901409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7901749Z     )
2025-05-07T20:32:32.7902113Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7902567Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7902830Z         self,
2025-05-07T20:32:32.7903035Z         T: int,
2025-05-07T20:32:32.7903270Z         D: int,
2025-05-07T20:32:32.7903490Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7903772Z         contiguous: bool,
2025-05-07T20:32:32.7904031Z         compiled: bool,
2025-05-07T20:32:32.7904257Z     ) -> None:
2025-05-07T20:32:32.7904486Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7904743Z     
2025-05-07T20:32:32.7905024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7905386Z     
2025-05-07T20:32:32.7905592Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7905891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7906203Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7906454Z         x0 = x[:, :D]
2025-05-07T20:32:32.7906682Z         x1 = x[:, D:]
2025-05-07T20:32:32.7906890Z     
2025-05-07T20:32:32.7907091Z         if contiguous:
2025-05-07T20:32:32.7907334Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7907602Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7907862Z     
2025-05-07T20:32:32.7908060Z         if scale_ub is not None:
2025-05-07T20:32:32.7908337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7908691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7909014Z             )
2025-05-07T20:32:32.7909209Z         else:
2025-05-07T20:32:32.7909430Z             scale_ub_tensor = None
2025-05-07T20:32:32.7909689Z     
2025-05-07T20:32:32.7909924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7910254Z             op = silu_mul_quant
2025-05-07T20:32:32.7910521Z             if compiled:
2025-05-07T20:32:32.7910785Z                 op = torch.compile(op)
2025-05-07T20:32:32.7911087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7911378Z     
2025-05-07T20:32:32.7911583Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7911873Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7912180Z     
2025-05-07T20:32:32.7912429Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7912766Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7913073Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7913403Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7913765Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7914089Z     
2025-05-07T20:32:32.7914305Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7914503Z 
2025-05-07T20:32:32.7914618Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7914917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7915264Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7915599Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7916572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7917338Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7917893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7918663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7919355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7920084Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7920823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7921469Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7922074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7922612Z     fn()
2025-05-07T20:32:32.7923131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7923716Z     self.fn.run(
2025-05-07T20:32:32.7924201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7924746Z     kernel = self.compile(
2025-05-07T20:32:32.7925286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7925951Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7926370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7926602Z 
2025-05-07T20:32:32.7926820Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ebc3e0>
2025-05-07T20:32:32.7927913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7929301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c0535ac00>}
2025-05-07T20:32:32.7930663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7931694Z context = <triton._C.libtriton.ir.context object at 0x7f5c04d3ef30>
2025-05-07T20:32:32.7931988Z 
2025-05-07T20:32:32.7932165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7932691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7933197Z                            module_map=module_map)
2025-05-07T20:32:32.7933603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7933969Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7934252Z E       ^
2025-05-07T20:32:32.7934729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7935185Z 
2025-05-07T20:32:32.7935610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7936123Z 
2025-05-07T20:32:32.7936229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7936647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7937059Z     T=16384,
2025-05-07T20:32:32.7937254Z     D=7168,
2025-05-07T20:32:32.7937460Z     scale_ub=1200.0,
2025-05-07T20:32:32.7937690Z     contiguous=False,
2025-05-07T20:32:32.7937927Z     compiled=False,
2025-05-07T20:32:32.7938137Z )
2025-05-07T20:32:32.9755599Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.9757827Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:32.9760748Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.9763295Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.9764272Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.9765836Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.9767240Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9768232Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.9769468Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.9770854Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9771926Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.9773220Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.9774473Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:32.9775709Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.9776914Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:32.9777752Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.9778785Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:32.9779808Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:32.9780600Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:32.9781968Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.9783258Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.9784489Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.9785538Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:32.9786717Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.9788091Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.9789162Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9790091Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9790840Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:32.9791866Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.0339050Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.0340115Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:33.0341449Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.0342879Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.0343858Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.0345167Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.0346558Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.0347544Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.0348775Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.0350159Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.0351515Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.0352962Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.0354221Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:33.0355435Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.0356716Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:33.0357546Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.0358579Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:33.0359589Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:33.0360386Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:33.0361592Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.0362876Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.0363995Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:33.0365039Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:33.0366445Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.0367805Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.0368873Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.0369791Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.0370530Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:33.0371558Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2244325Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.2245743Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:33.2247097Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.2248700Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.2249676Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2250983Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.2252370Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2253361Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2254599Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.2255969Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2257044Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2258325Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.2259584Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:33.2260803Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.2262020Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:33.2262849Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2263878Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:33.2264904Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:33.2265969Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:33.2267174Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.2268595Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.2269720Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:33.2270874Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:33.2272052Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.2273407Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.2274484Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2275402Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2276213Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:33.2277232Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2337910Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.2339123Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:33.2340473Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.2341896Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.2342878Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2344224Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.2345608Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2346592Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2347820Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.2349198Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2350257Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2351645Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.2352969Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:33.2354238Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.2355443Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:33.2356350Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.2357384Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:33.2358403Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:33.2359204Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:33.2360403Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.2361686Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.2362810Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:33.2363900Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:33.2365084Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.2366769Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.2367834Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2368753Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2369490Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:33.2370507Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9737084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9737687Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.9738101Z 
2025-05-07T20:32:33.9738219Z     @given(
2025-05-07T20:32:33.9738528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9738945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9739339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9739765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9740424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9740718Z     )
2025-05-07T20:32:33.9741071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9741668Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9741909Z         self,
2025-05-07T20:32:33.9742119Z         T: int,
2025-05-07T20:32:33.9742318Z         D: int,
2025-05-07T20:32:33.9742538Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9742819Z         contiguous: bool,
2025-05-07T20:32:33.9743063Z         compiled: bool,
2025-05-07T20:32:33.9743290Z     ) -> None:
2025-05-07T20:32:33.9743513Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9743757Z     
2025-05-07T20:32:33.9744029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9744382Z     
2025-05-07T20:32:33.9744587Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9744877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9745194Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9745439Z         x0 = x[:, :D]
2025-05-07T20:32:33.9745657Z         x1 = x[:, D:]
2025-05-07T20:32:33.9745868Z     
2025-05-07T20:32:33.9746056Z         if contiguous:
2025-05-07T20:32:33.9746294Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9746554Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9746803Z     
2025-05-07T20:32:33.9747000Z         if scale_ub is not None:
2025-05-07T20:32:33.9747270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9747607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9747919Z             )
2025-05-07T20:32:33.9748108Z         else:
2025-05-07T20:32:33.9748319Z             scale_ub_tensor = None
2025-05-07T20:32:33.9748577Z     
2025-05-07T20:32:33.9748806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9749127Z             op = silu_mul_quant
2025-05-07T20:32:33.9749388Z             if compiled:
2025-05-07T20:32:33.9749650Z                 op = torch.compile(op)
2025-05-07T20:32:33.9749943Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9750224Z     
2025-05-07T20:32:33.9750424Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9750594Z 
2025-05-07T20:32:33.9750696Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9750995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9751333Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9751613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9752306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9752996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9753538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9754215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9754884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9755418Z     kernel = self.compile(
2025-05-07T20:32:33.9756063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9756722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9757123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9757354Z 
2025-05-07T20:32:33.9757572Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ebf3e0>
2025-05-07T20:32:33.9758657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9760123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05583ce0>}
2025-05-07T20:32:33.9761468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9762568Z context = <triton._C.libtriton.ir.context object at 0x7f5bea7563b0>
2025-05-07T20:32:33.9762856Z 
2025-05-07T20:32:33.9763030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9763545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9764016Z                            module_map=module_map)
2025-05-07T20:32:33.9764380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9764737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9764994Z E       ^
2025-05-07T20:32:33.9765754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9766205Z 
2025-05-07T20:32:33.9766623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9767136Z 
2025-05-07T20:32:33.9767245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9767652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9768059Z     T=1,
2025-05-07T20:32:33.9768245Z     D=7168,
2025-05-07T20:32:33.9768434Z     scale_ub=None,
2025-05-07T20:32:33.9768649Z     contiguous=True,
2025-05-07T20:32:33.9768870Z     compiled=True,
2025-05-07T20:32:33.9769075Z )
2025-05-07T20:32:33.9769393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9769897Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.9770153Z 
2025-05-07T20:32:33.9770239Z     @given(
2025-05-07T20:32:33.9770469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9770783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9771092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9771432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9771755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9772046Z     )
2025-05-07T20:32:33.9772393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9772831Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9773079Z         self,
2025-05-07T20:32:33.9773276Z         T: int,
2025-05-07T20:32:33.9773468Z         D: int,
2025-05-07T20:32:33.9773700Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9773984Z         contiguous: bool,
2025-05-07T20:32:33.9774219Z         compiled: bool,
2025-05-07T20:32:33.9774444Z     ) -> None:
2025-05-07T20:32:33.9774664Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9774910Z     
2025-05-07T20:32:33.9775192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9775538Z     
2025-05-07T20:32:33.9775729Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9776025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9776340Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9776588Z         x0 = x[:, :D]
2025-05-07T20:32:33.9776806Z         x1 = x[:, D:]
2025-05-07T20:32:33.9777019Z     
2025-05-07T20:32:33.9777206Z         if contiguous:
2025-05-07T20:32:33.9777437Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9777694Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9777941Z     
2025-05-07T20:32:33.9778127Z         if scale_ub is not None:
2025-05-07T20:32:33.9778404Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9778737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9779043Z             )
2025-05-07T20:32:33.9779240Z         else:
2025-05-07T20:32:33.9779589Z             scale_ub_tensor = None
2025-05-07T20:32:33.9779838Z     
2025-05-07T20:32:33.9780071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9780386Z             op = silu_mul_quant
2025-05-07T20:32:33.9780746Z             if compiled:
2025-05-07T20:32:33.9780992Z                 op = torch.compile(op)
2025-05-07T20:32:33.9781287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9781563Z     
2025-05-07T20:32:33.9781756Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.9782044Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.9782338Z     
2025-05-07T20:32:33.9782576Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9782918Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.9783223Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.9783533Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.9783907Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.9784222Z     
2025-05-07T20:32:33.9784424Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.9784626Z 
2025-05-07T20:32:33.9784728Z moe/activation_test.py:126: 
2025-05-07T20:32:33.9785035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9785372Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.9785705Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.9786495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.9787248Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.9787794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9788484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9789180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.9789906Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.9790637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.9791276Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.9791876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.9792387Z     fn()
2025-05-07T20:32:33.9792892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.9793475Z     self.fn.run(
2025-05-07T20:32:33.9802236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9802789Z     kernel = self.compile(
2025-05-07T20:32:33.9803349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9804003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9804414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9804652Z 
2025-05-07T20:32:33.9804859Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0537cc20>
2025-05-07T20:32:33.9805951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9807321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04fac720>}
2025-05-07T20:32:33.9808785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9809896Z context = <triton._C.libtriton.ir.context object at 0x7f5bea600d30>
2025-05-07T20:32:33.9810186Z 
2025-05-07T20:32:33.9810361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9810886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9811354Z                            module_map=module_map)
2025-05-07T20:32:33.9811727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9812089Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.9812352Z E       ^
2025-05-07T20:32:33.9812823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9813274Z 
2025-05-07T20:32:33.9813708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9814218Z 
2025-05-07T20:32:33.9814330Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9814748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9815164Z     T=4096,
2025-05-07T20:32:33.9815362Z     D=5120,
2025-05-07T20:32:33.9815552Z     scale_ub=None,
2025-05-07T20:32:33.9815772Z     contiguous=False,
2025-05-07T20:32:33.9816004Z     compiled=False,
2025-05-07T20:32:33.9816210Z )
2025-05-07T20:32:34.2706531Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.2707632Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.2709015Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.2710480Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.2711469Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2712783Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.2714181Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2715184Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2716558Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.2717937Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2719017Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2720650Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.2722078Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.2723309Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.2724526Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.2725364Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2726406Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:34.2727442Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.2728245Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:34.2729467Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.2730757Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.2731888Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:34.2732952Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.2734141Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.2735512Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.2736588Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2737513Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2738259Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.2739284Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4794440Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.4795546Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.4797382Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.4798924Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.4800077Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.4801396Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.4802785Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4804000Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.4805259Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.4806641Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4807718Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.4809010Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.4810268Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.4811499Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.4812706Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.4813544Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.4814583Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:34.4815805Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.4816781Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:34.4818278Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.4819863Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.4821242Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:34.4822620Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.4823802Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.4825294Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.4826365Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4827282Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4828032Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.4829053Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7744159Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.7745272Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.7746638Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.7748118Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.7749108Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7750425Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.7751820Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7752820Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7754074Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.7755471Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7756608Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7757894Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.7759475Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.7760710Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.7762094Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.7762927Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7763968Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:34.7765053Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.7766109Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:34.7767470Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.7768767Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.7769888Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:34.7770944Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.7772134Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.7773507Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.7774570Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7775489Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7776247Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.7777287Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.7846959Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.7848192Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.7849524Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.7850950Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.7852156Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7853582Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.7854970Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.7855955Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7857195Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.7858574Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.7859655Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7860934Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.7862179Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.7863407Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.7864620Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.7865732Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.7866761Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:34.7867775Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.7868588Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:34.7869798Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.7871086Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.7872201Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:34.7873244Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.7874549Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.7875992Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.7877172Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.7878081Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.7878826Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.7879851Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1366330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1367033Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.1367331Z 
2025-05-07T20:32:36.1367415Z     @given(
2025-05-07T20:32:36.1367656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1367979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1368291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1368620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1368950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1369245Z     )
2025-05-07T20:32:36.1369592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1370040Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1370286Z         self,
2025-05-07T20:32:36.1370485Z         T: int,
2025-05-07T20:32:36.1370692Z         D: int,
2025-05-07T20:32:36.1370921Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1371196Z         contiguous: bool,
2025-05-07T20:32:36.1371447Z         compiled: bool,
2025-05-07T20:32:36.1371682Z     ) -> None:
2025-05-07T20:32:36.1371906Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1372160Z     
2025-05-07T20:32:36.1372438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1372781Z     
2025-05-07T20:32:36.1372991Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1373289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1373612Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1373858Z         x0 = x[:, :D]
2025-05-07T20:32:36.1374083Z         x1 = x[:, D:]
2025-05-07T20:32:36.1374302Z     
2025-05-07T20:32:36.1374491Z         if contiguous:
2025-05-07T20:32:36.1374727Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1374990Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1375231Z     
2025-05-07T20:32:36.1375431Z         if scale_ub is not None:
2025-05-07T20:32:36.1375706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1376045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1376364Z             )
2025-05-07T20:32:36.1376563Z         else:
2025-05-07T20:32:36.1376778Z             scale_ub_tensor = None
2025-05-07T20:32:36.1377036Z     
2025-05-07T20:32:36.1377275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1377595Z             op = silu_mul_quant
2025-05-07T20:32:36.1377849Z             if compiled:
2025-05-07T20:32:36.1378104Z                 op = torch.compile(op)
2025-05-07T20:32:36.1378412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1378684Z     
2025-05-07T20:32:36.1378887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.1379051Z 
2025-05-07T20:32:36.1379163Z moe/activation_test.py:117: 
2025-05-07T20:32:36.1379458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1380140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.1380434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1381126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.1381974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.1382515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1383202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1383864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1384400Z     kernel = self.compile(
2025-05-07T20:32:36.1384946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1385602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1386007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1386245Z 
2025-05-07T20:32:36.1386453Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c051b6b10>
2025-05-07T20:32:36.1387547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1388943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c49a0>}
2025-05-07T20:32:36.1390286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1391322Z context = <triton._C.libtriton.ir.context object at 0x7f5bea629cf0>
2025-05-07T20:32:36.1391620Z 
2025-05-07T20:32:36.1391787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1392314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1392785Z                            module_map=module_map)
2025-05-07T20:32:36.1393156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1393518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1393780Z E       ^
2025-05-07T20:32:36.1394250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1394712Z 
2025-05-07T20:32:36.1395127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1395638Z 
2025-05-07T20:32:36.1395850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1396270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1396677Z     T=4096,
2025-05-07T20:32:36.1396873Z     D=7168,
2025-05-07T20:32:36.1397065Z     scale_ub=None,
2025-05-07T20:32:36.1397290Z     contiguous=False,
2025-05-07T20:32:36.1397518Z     compiled=False,
2025-05-07T20:32:36.1397725Z )
2025-05-07T20:32:36.1398047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1398548Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.1398821Z 
2025-05-07T20:32:36.1398906Z     @given(
2025-05-07T20:32:36.1399135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1399451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1399766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1400098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1400429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1400842Z     )
2025-05-07T20:32:36.1401190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1401634Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1401959Z         self,
2025-05-07T20:32:36.1402158Z         T: int,
2025-05-07T20:32:36.1402356Z         D: int,
2025-05-07T20:32:36.1402578Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1402854Z         contiguous: bool,
2025-05-07T20:32:36.1403091Z         compiled: bool,
2025-05-07T20:32:36.1403314Z     ) -> None:
2025-05-07T20:32:36.1403530Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1403770Z     
2025-05-07T20:32:36.1404043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1404389Z     
2025-05-07T20:32:36.1404580Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1404885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1405201Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1405441Z         x0 = x[:, :D]
2025-05-07T20:32:36.1405668Z         x1 = x[:, D:]
2025-05-07T20:32:36.1405878Z     
2025-05-07T20:32:36.1406064Z         if contiguous:
2025-05-07T20:32:36.1406300Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1406574Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1406817Z     
2025-05-07T20:32:36.1407016Z         if scale_ub is not None:
2025-05-07T20:32:36.1407296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1407635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1407944Z             )
2025-05-07T20:32:36.1408142Z         else:
2025-05-07T20:32:36.1408359Z             scale_ub_tensor = None
2025-05-07T20:32:36.1408610Z     
2025-05-07T20:32:36.1408848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1409169Z             op = silu_mul_quant
2025-05-07T20:32:36.1409422Z             if compiled:
2025-05-07T20:32:36.1409678Z                 op = torch.compile(op)
2025-05-07T20:32:36.1409993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1410276Z     
2025-05-07T20:32:36.1410470Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.1410642Z 
2025-05-07T20:32:36.1410742Z moe/activation_test.py:117: 
2025-05-07T20:32:36.1411044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1411377Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.1411664Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1412355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.1413048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.1413582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1414265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1414933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1415461Z     kernel = self.compile(
2025-05-07T20:32:36.1416004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1416662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1417063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1417293Z 
2025-05-07T20:32:36.1417499Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0537ec30>
2025-05-07T20:32:36.1418581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1420050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c4e00>}
2025-05-07T20:32:36.1421402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1422506Z context = <triton._C.libtriton.ir.context object at 0x7f5be9f80430>
2025-05-07T20:32:36.1422803Z 
2025-05-07T20:32:36.1422970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1423496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1423967Z                            module_map=module_map)
2025-05-07T20:32:36.1424328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1424686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1424971Z E       ^
2025-05-07T20:32:36.1425460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1425919Z 
2025-05-07T20:32:36.1426338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1426862Z 
2025-05-07T20:32:36.1426966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1427386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1427784Z     T=128,
2025-05-07T20:32:36.1427977Z     D=7168,
2025-05-07T20:32:36.1428174Z     scale_ub=None,
2025-05-07T20:32:36.1428386Z     contiguous=False,
2025-05-07T20:32:36.1428616Z     compiled=True,
2025-05-07T20:32:36.1428822Z )
2025-05-07T20:32:36.2006129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2006795Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.2007068Z 
2025-05-07T20:32:36.2007150Z     @given(
2025-05-07T20:32:36.2007411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2007733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2008047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2008388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2008741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2009028Z     )
2025-05-07T20:32:36.2009390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2009846Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2010103Z         self,
2025-05-07T20:32:36.2010311Z         T: int,
2025-05-07T20:32:36.2010517Z         D: int,
2025-05-07T20:32:36.2010744Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2011021Z         contiguous: bool,
2025-05-07T20:32:36.2011265Z         compiled: bool,
2025-05-07T20:32:36.2011505Z     ) -> None:
2025-05-07T20:32:36.2011723Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2011971Z     
2025-05-07T20:32:36.2012259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2012603Z     
2025-05-07T20:32:36.2012810Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2013113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2013430Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2013678Z         x0 = x[:, :D]
2025-05-07T20:32:36.2013907Z         x1 = x[:, D:]
2025-05-07T20:32:36.2014119Z     
2025-05-07T20:32:36.2014320Z         if contiguous:
2025-05-07T20:32:36.2014563Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2014824Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2015078Z     
2025-05-07T20:32:36.2015304Z         if scale_ub is not None:
2025-05-07T20:32:36.2015582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2015932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2016255Z             )
2025-05-07T20:32:36.2016451Z         else:
2025-05-07T20:32:36.2016682Z             scale_ub_tensor = None
2025-05-07T20:32:36.2017706Z     
2025-05-07T20:32:36.2017957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2018276Z             op = silu_mul_quant
2025-05-07T20:32:36.2018537Z             if compiled:
2025-05-07T20:32:36.2018959Z                 op = torch.compile(op)
2025-05-07T20:32:36.2019258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2019541Z     
2025-05-07T20:32:36.2019743Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.2020028Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.2020330Z     
2025-05-07T20:32:36.2020580Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2020918Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.2021217Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.2021536Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.2031135Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2031501Z     
2025-05-07T20:32:36.2031734Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.2031943Z 
2025-05-07T20:32:36.2032055Z moe/activation_test.py:126: 
2025-05-07T20:32:36.2032370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2032733Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.2033068Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.2033874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.2034637Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.2035195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2035992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2036701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.2037436Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.2038179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.2038825Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.2039438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.2040032Z     fn()
2025-05-07T20:32:36.2040554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.2041152Z     self.fn.run(
2025-05-07T20:32:36.2041635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2042180Z     kernel = self.compile(
2025-05-07T20:32:36.2042728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2043396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2043813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2044048Z 
2025-05-07T20:32:36.2044271Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4cecf0>
2025-05-07T20:32:36.2045362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2046765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c5a80>}
2025-05-07T20:32:36.2048273Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2049316Z context = <triton._C.libtriton.ir.context object at 0x7f5bea48b8b0>
2025-05-07T20:32:36.2049692Z 
2025-05-07T20:32:36.2049863Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2050397Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2050878Z                            module_map=module_map)
2025-05-07T20:32:36.2051254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2051621Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.2051902Z E       ^
2025-05-07T20:32:36.2052382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2052836Z 
2025-05-07T20:32:36.2053261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2053782Z 
2025-05-07T20:32:36.2053890Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2054319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2054748Z     T=128,
2025-05-07T20:32:36.2054977Z     D=7168,
2025-05-07T20:32:36.2055204Z     scale_ub=None,
2025-05-07T20:32:36.2055431Z     contiguous=False,
2025-05-07T20:32:36.2055664Z     compiled=False,
2025-05-07T20:32:36.2055886Z )
2025-05-07T20:32:36.4027582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4028392Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.4028676Z 
2025-05-07T20:32:36.4028770Z     @given(
2025-05-07T20:32:36.4029005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4029335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4029682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4030026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4030362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4030659Z     )
2025-05-07T20:32:36.4031033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4031480Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4031733Z         self,
2025-05-07T20:32:36.4031941Z         T: int,
2025-05-07T20:32:36.4032151Z         D: int,
2025-05-07T20:32:36.4032382Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4032664Z         contiguous: bool,
2025-05-07T20:32:36.4032911Z         compiled: bool,
2025-05-07T20:32:36.4033149Z     ) -> None:
2025-05-07T20:32:36.4033376Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4033625Z     
2025-05-07T20:32:36.4033914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4034282Z     
2025-05-07T20:32:36.4034487Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4034801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4035162Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4035429Z         x0 = x[:, :D]
2025-05-07T20:32:36.4035654Z         x1 = x[:, D:]
2025-05-07T20:32:36.4035955Z     
2025-05-07T20:32:36.4036154Z         if contiguous:
2025-05-07T20:32:36.4036389Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4036662Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4036913Z     
2025-05-07T20:32:36.4037106Z         if scale_ub is not None:
2025-05-07T20:32:36.4037390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4037737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4038056Z             )
2025-05-07T20:32:36.4038259Z         else:
2025-05-07T20:32:36.4038483Z             scale_ub_tensor = None
2025-05-07T20:32:36.4038739Z     
2025-05-07T20:32:36.4038978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4039629Z             op = silu_mul_quant
2025-05-07T20:32:36.4039901Z             if compiled:
2025-05-07T20:32:36.4040175Z                 op = torch.compile(op)
2025-05-07T20:32:36.4040505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4040956Z     
2025-05-07T20:32:36.4041157Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.4041345Z 
2025-05-07T20:32:36.4041454Z moe/activation_test.py:117: 
2025-05-07T20:32:36.4041792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4042167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.4042477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4043312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.4044141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.4044776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4045649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4046450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4047083Z     kernel = self.compile(
2025-05-07T20:32:36.4047722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4048507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4048968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4049236Z 
2025-05-07T20:32:36.4049471Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0512a930>
2025-05-07T20:32:36.4050799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4052517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19c540>}
2025-05-07T20:32:36.4054192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4055440Z context = <triton._C.libtriton.ir.context object at 0x7f5be9bc9670>
2025-05-07T20:32:36.4055786Z 
2025-05-07T20:32:36.4055972Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4056586Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4057139Z                            module_map=module_map)
2025-05-07T20:32:36.4057546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4057954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.4058242Z E       ^
2025-05-07T20:32:36.4058781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4059340Z 
2025-05-07T20:32:36.4059841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4060471Z 
2025-05-07T20:32:36.4060590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4061064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4061531Z     T=4096,
2025-05-07T20:32:36.4061737Z     D=5120,
2025-05-07T20:32:36.4061943Z     scale_ub=1200.0,
2025-05-07T20:32:36.4062186Z     contiguous=True,
2025-05-07T20:32:36.4062425Z     compiled=False,
2025-05-07T20:32:36.4062635Z )
2025-05-07T20:32:36.4062959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4063544Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.4063823Z 
2025-05-07T20:32:36.4063910Z     @given(
2025-05-07T20:32:36.4064141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4064547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4064863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4065195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4065761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4066133Z     )
2025-05-07T20:32:36.4066481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4066924Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4067169Z         self,
2025-05-07T20:32:36.4067364Z         T: int,
2025-05-07T20:32:36.4067567Z         D: int,
2025-05-07T20:32:36.4067789Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4068066Z         contiguous: bool,
2025-05-07T20:32:36.4068312Z         compiled: bool,
2025-05-07T20:32:36.4068540Z     ) -> None:
2025-05-07T20:32:36.4068760Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4069001Z     
2025-05-07T20:32:36.4069276Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4069632Z     
2025-05-07T20:32:36.4069825Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4070121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4070441Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4070681Z         x0 = x[:, :D]
2025-05-07T20:32:36.4070905Z         x1 = x[:, D:]
2025-05-07T20:32:36.4071122Z     
2025-05-07T20:32:36.4071309Z         if contiguous:
2025-05-07T20:32:36.4071548Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4071814Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4072055Z     
2025-05-07T20:32:36.4072257Z         if scale_ub is not None:
2025-05-07T20:32:36.4072541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4072880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4073198Z             )
2025-05-07T20:32:36.4073399Z         else:
2025-05-07T20:32:36.4073618Z             scale_ub_tensor = None
2025-05-07T20:32:36.4073874Z     
2025-05-07T20:32:36.4074114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4074441Z             op = silu_mul_quant
2025-05-07T20:32:36.4074691Z             if compiled:
2025-05-07T20:32:36.4074949Z                 op = torch.compile(op)
2025-05-07T20:32:36.4075250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4075527Z     
2025-05-07T20:32:36.4075793Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.4075960Z 
2025-05-07T20:32:36.4076071Z moe/activation_test.py:117: 
2025-05-07T20:32:36.4076373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4076716Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.4077004Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4077704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.4078390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.4078936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4079620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4080285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4080822Z     kernel = self.compile(
2025-05-07T20:32:36.4081367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4082022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4082423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4082660Z 
2025-05-07T20:32:36.4083014Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beab834a0>
2025-05-07T20:32:36.4084099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4085614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19e480>}
2025-05-07T20:32:36.4086953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4087983Z context = <triton._C.libtriton.ir.context object at 0x7f5be9c0b770>
2025-05-07T20:32:36.4088277Z 
2025-05-07T20:32:36.4088452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4088981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4089446Z                            module_map=module_map)
2025-05-07T20:32:36.4089825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4090185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.4090452Z E       ^
2025-05-07T20:32:36.4090913Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4091373Z 
2025-05-07T20:32:36.4091788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4092299Z 
2025-05-07T20:32:36.4092410Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4092820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4093227Z     T=1,
2025-05-07T20:32:36.4093425Z     D=5120,
2025-05-07T20:32:36.4093623Z     scale_ub=None,
2025-05-07T20:32:36.4093834Z     contiguous=True,
2025-05-07T20:32:36.4094059Z     compiled=True,
2025-05-07T20:32:36.4094268Z )
2025-05-07T20:32:36.6439332Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.6440425Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.6441777Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.6443241Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.6444234Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6445553Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.6446945Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.6447943Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6449482Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.6450868Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6452148Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6453430Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.6454687Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.6455961Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.6457181Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.6458014Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6459040Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:36.6460064Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.6460871Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:36.6462086Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.6463372Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.6464494Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:36.6465866Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.6467066Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.6468432Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.6469656Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6470575Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6471410Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.6472692Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7138663Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.7141269Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.7143967Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.7145871Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.7146869Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.7148186Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.7149577Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7150577Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.7151816Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.7153208Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7154294Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.7155628Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.7156961Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.7158196Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.7159414Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.7160257Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.7161280Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:36.7162310Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.7163114Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:36.7164474Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.7166118Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.7167240Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:36.7168289Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.7169486Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.7170854Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.7171922Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7172841Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7173586Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.7174621Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9172800Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.9173863Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.9175234Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.9177050Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.9178248Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9179870Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.9181584Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9182787Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9184306Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.9186369Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9187442Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9188862Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.9190121Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.9191339Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.9192553Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.9193378Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9194417Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:36.9195444Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.9196330Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:36.9197536Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.9198812Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.9199942Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:36.9200981Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.9202165Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.9203521Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.9204583Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9205556Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9206297Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.9207316Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9277953Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.9279179Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.9280525Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.9282068Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.9283059Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9284374Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.9285768Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9286769Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9288007Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.9289389Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9290469Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9291758Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.9293013Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.9294240Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.9295460Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.9296293Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.9297324Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:36.9298360Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.9299163Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:36.9300381Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.9301741Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.9302872Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:36.9304000Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.9305184Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.9306554Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.9307628Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9308548Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9309304Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.9310337Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1482816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1483554Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.1483833Z 
2025-05-07T20:32:37.1483919Z     @given(
2025-05-07T20:32:37.1484167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1484521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1484833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1485172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1485523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1485813Z     )
2025-05-07T20:32:37.1486177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1486629Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1486872Z         self,
2025-05-07T20:32:37.1487078Z         T: int,
2025-05-07T20:32:37.1487287Z         D: int,
2025-05-07T20:32:37.1487512Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1487819Z         contiguous: bool,
2025-05-07T20:32:37.1488072Z         compiled: bool,
2025-05-07T20:32:37.1488307Z     ) -> None:
2025-05-07T20:32:37.1488528Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1488782Z     
2025-05-07T20:32:37.1489068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1489430Z     
2025-05-07T20:32:37.1489630Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1489934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1490256Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1490511Z         x0 = x[:, :D]
2025-05-07T20:32:37.1490740Z         x1 = x[:, D:]
2025-05-07T20:32:37.1490961Z     
2025-05-07T20:32:37.1491152Z         if contiguous:
2025-05-07T20:32:37.1491390Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1499870Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1500148Z     
2025-05-07T20:32:37.1500346Z         if scale_ub is not None:
2025-05-07T20:32:37.1500623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1500957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1501299Z             )
2025-05-07T20:32:37.1501495Z         else:
2025-05-07T20:32:37.1501710Z             scale_ub_tensor = None
2025-05-07T20:32:37.1501976Z     
2025-05-07T20:32:37.1502557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1502875Z             op = silu_mul_quant
2025-05-07T20:32:37.1503118Z             if compiled:
2025-05-07T20:32:37.1503362Z                 op = torch.compile(op)
2025-05-07T20:32:37.1503812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1504088Z     
2025-05-07T20:32:37.1504295Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.1504585Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.1504887Z     
2025-05-07T20:32:37.1505133Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1505478Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.1505782Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.1506110Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.1506479Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.1506794Z     
2025-05-07T20:32:37.1507013Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.1507211Z 
2025-05-07T20:32:37.1507325Z moe/activation_test.py:126: 
2025-05-07T20:32:37.1507626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1507984Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.1508322Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.1509113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.1509874Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.1510426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1511118Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1511812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.1512552Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.1513300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.1513955Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.1514561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.1515090Z     fn()
2025-05-07T20:32:37.1515608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.1516314Z     self.fn.run(
2025-05-07T20:32:37.1516788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1517332Z     kernel = self.compile(
2025-05-07T20:32:37.1517881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1518532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1518942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1519178Z 
2025-05-07T20:32:37.1519394Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beab83560>
2025-05-07T20:32:37.1520485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1521879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05058c20>}
2025-05-07T20:32:37.1523321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1524360Z context = <triton._C.libtriton.ir.context object at 0x7f5be9c46a30>
2025-05-07T20:32:37.1524651Z 
2025-05-07T20:32:37.1524905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1525427Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1525908Z                            module_map=module_map)
2025-05-07T20:32:37.1526277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1526642Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.1526911Z E       ^
2025-05-07T20:32:37.1527382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1527835Z 
2025-05-07T20:32:37.1528259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1528779Z 
2025-05-07T20:32:37.1528891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1529313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1529734Z     T=2048,
2025-05-07T20:32:37.1529932Z     D=5120,
2025-05-07T20:32:37.1530132Z     scale_ub=None,
2025-05-07T20:32:37.1530360Z     contiguous=True,
2025-05-07T20:32:37.1530593Z     compiled=True,
2025-05-07T20:32:37.1530802Z )
2025-05-07T20:32:37.3786614Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.3787885Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:37.3789269Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.3790723Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.3791721Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.3793042Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.3794439Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.3795444Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.3796797Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.3798187Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.3799271Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.3800904Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.3802173Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:37.3803553Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.3804768Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:37.3805615Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.3806665Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:37.3807694Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:37.3808512Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:37.3809727Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.3811024Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.3812169Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:37.3813224Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:37.3814412Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.3815848Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.3816927Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.3817854Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.3818614Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:37.3819641Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4480335Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.4481650Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:37.4482997Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.4484816Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.4485856Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.4487347Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.4488739Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4489746Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.4490989Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.4492385Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4493459Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.4494754Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.4496012Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:37.4497243Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.4498473Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:37.4499311Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.4500346Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:37.4501382Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:37.4502188Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:37.4503405Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.4504698Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.4505828Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:37.4506965Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:37.4508164Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.4509605Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.4510688Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4511613Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4512371Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:37.4513401Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.6514484Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.6516426Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:37.6517767Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.6519213Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.6520204Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6521513Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.6522901Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.6523892Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6525129Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.6526507Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.6527574Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6528858Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.6530113Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:37.6531658Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.6533019Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:37.6533842Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6534871Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:37.6535945Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:37.6536752Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:37.6537965Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.6539247Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.6540368Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:37.6541418Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:37.6542607Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.6543970Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.6545034Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.6545996Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.6546744Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:37.6547770Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.6614366Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.6615612Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:37.6616992Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.6618412Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.6619577Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6620884Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.6622366Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.6623355Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6624580Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.6626005Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.6627079Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6628352Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.6629595Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:37.6630822Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.6632032Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:37.6632865Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.6633882Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:37.6634906Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:37.6635841Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:37.6637064Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.6638343Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.6639459Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:37.6640500Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:37.6641676Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.6643149Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.6644278Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.6645186Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.6645924Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:37.6646942Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8706276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8706907Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.8707182Z 
2025-05-07T20:32:37.8707264Z     @given(
2025-05-07T20:32:37.8707504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8707835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8708146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8708487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8708825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8709111Z     )
2025-05-07T20:32:37.8709467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8709913Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8710160Z         self,
2025-05-07T20:32:37.8710355Z         T: int,
2025-05-07T20:32:37.8710563Z         D: int,
2025-05-07T20:32:37.8710787Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8711062Z         contiguous: bool,
2025-05-07T20:32:37.8711311Z         compiled: bool,
2025-05-07T20:32:37.8711547Z     ) -> None:
2025-05-07T20:32:37.8711762Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8712007Z     
2025-05-07T20:32:37.8712283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8712628Z     
2025-05-07T20:32:37.8712830Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8713125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8713431Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8713678Z         x0 = x[:, :D]
2025-05-07T20:32:37.8713904Z         x1 = x[:, D:]
2025-05-07T20:32:37.8714108Z     
2025-05-07T20:32:37.8714304Z         if contiguous:
2025-05-07T20:32:37.8714538Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8714793Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8715038Z     
2025-05-07T20:32:37.8715236Z         if scale_ub is not None:
2025-05-07T20:32:37.8715513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8715952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8716269Z             )
2025-05-07T20:32:37.8716464Z         else:
2025-05-07T20:32:37.8716672Z             scale_ub_tensor = None
2025-05-07T20:32:37.8716934Z     
2025-05-07T20:32:37.8717171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8717483Z             op = silu_mul_quant
2025-05-07T20:32:37.8717736Z             if compiled:
2025-05-07T20:32:37.8717986Z                 op = torch.compile(op)
2025-05-07T20:32:37.8718283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8718565Z     
2025-05-07T20:32:37.8718762Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.8719046Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.8719344Z     
2025-05-07T20:32:37.8719589Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8719927Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.8720538Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.8720864Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.8721226Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.8721676Z     
2025-05-07T20:32:37.8721882Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.8722077Z 
2025-05-07T20:32:37.8722186Z moe/activation_test.py:126: 
2025-05-07T20:32:37.8722480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8722824Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.8723155Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.8723946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.8724691Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.8725244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8725926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8726610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.8727340Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.8728073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.8728711Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.8729308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.8729825Z     fn()
2025-05-07T20:32:37.8730335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.8730919Z     self.fn.run(
2025-05-07T20:32:37.8731385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8731921Z     kernel = self.compile(
2025-05-07T20:32:37.8732461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8733112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8733513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8733741Z 
2025-05-07T20:32:37.8733954Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4ce2a0>
2025-05-07T20:32:37.8735039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8736421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea1553a0>}
2025-05-07T20:32:37.8737765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8738795Z context = <triton._C.libtriton.ir.context object at 0x7f5be9afdab0>
2025-05-07T20:32:37.8739083Z 
2025-05-07T20:32:37.8739256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8739774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8740248Z                            module_map=module_map)
2025-05-07T20:32:37.8740618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8740978Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.8741241Z E       ^
2025-05-07T20:32:37.8741804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8742256Z 
2025-05-07T20:32:37.8742675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8743262Z 
2025-05-07T20:32:37.8743372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8743784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8744189Z     T=128,
2025-05-07T20:32:37.8744382Z     D=5120,
2025-05-07T20:32:37.8744575Z     scale_ub=None,
2025-05-07T20:32:37.8744800Z     contiguous=True,
2025-05-07T20:32:37.8745028Z     compiled=True,
2025-05-07T20:32:37.8745235Z )
2025-05-07T20:32:38.1087121Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.1088237Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:38.1089586Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.1091028Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.1092010Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1093328Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.1094704Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.1095696Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1096983Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.1098359Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.1099427Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1100708Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.1101958Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:38.1103173Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.1104373Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:38.1105527Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1106550Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.1107775Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:38.1108567Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.1109772Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.1111061Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.1112167Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.1113215Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:38.1114392Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.1115842Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.1116909Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.1117825Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.1118582Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:38.1119612Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.1785698Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.1788449Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:38.1791127Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.1793965Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.1795973Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1797326Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.1799057Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.1800041Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1801414Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.1802790Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.1803863Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1805147Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.1806392Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:38.1807608Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.1808818Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:38.1809647Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.1810676Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.1811700Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:38.1812493Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.1813700Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.1814989Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.1816156Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.1817211Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:38.1818401Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.1819763Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.1820828Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.1821829Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.1822572Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:38.1823672Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3840906Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.3843001Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:38.3845663Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.3847250Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.3848246Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3849549Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.3850925Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3851915Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3853133Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.3854513Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3855577Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3856862Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.3858107Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:38.3859326Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.3860549Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:38.3873230Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3874672Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.3875811Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:38.3876778Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.3877998Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.3879306Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.3880424Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.3881481Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:38.3882677Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.3884056Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.3885138Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3886080Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3886846Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:38.3887870Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3943600Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.3944687Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:38.3946023Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.3947449Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.3948435Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3949744Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.3951122Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3952230Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3953464Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.3954928Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3956088Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3957373Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.3958633Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:38.3959854Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.3961080Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:38.3961918Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.3962938Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.3963966Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:38.3964763Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.3966247Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.3967532Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.3968660Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.3969708Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:38.3970900Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.3972264Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.3973328Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3974247Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3974992Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:38.3976175Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6418212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6419183Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.6419466Z 
2025-05-07T20:32:38.6419553Z     @given(
2025-05-07T20:32:38.6419798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6420115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6420436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6420776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6421109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6421408Z     )
2025-05-07T20:32:38.6421771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6422226Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6422486Z         self,
2025-05-07T20:32:38.6422701Z         T: int,
2025-05-07T20:32:38.6422910Z         D: int,
2025-05-07T20:32:38.6423147Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6423440Z         contiguous: bool,
2025-05-07T20:32:38.6423693Z         compiled: bool,
2025-05-07T20:32:38.6423923Z     ) -> None:
2025-05-07T20:32:38.6424149Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6424400Z     
2025-05-07T20:32:38.6424681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6425038Z     
2025-05-07T20:32:38.6425242Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6425543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6425871Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6426129Z         x0 = x[:, :D]
2025-05-07T20:32:38.6426352Z         x1 = x[:, D:]
2025-05-07T20:32:38.6426571Z     
2025-05-07T20:32:38.6426775Z         if contiguous:
2025-05-07T20:32:38.6427020Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6427299Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6427551Z     
2025-05-07T20:32:38.6427753Z         if scale_ub is not None:
2025-05-07T20:32:38.6428039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6428394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6428710Z             )
2025-05-07T20:32:38.6428920Z         else:
2025-05-07T20:32:38.6429148Z             scale_ub_tensor = None
2025-05-07T20:32:38.6429418Z     
2025-05-07T20:32:38.6429658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6429996Z             op = silu_mul_quant
2025-05-07T20:32:38.6430263Z             if compiled:
2025-05-07T20:32:38.6430524Z                 op = torch.compile(op)
2025-05-07T20:32:38.6430845Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6431144Z     
2025-05-07T20:32:38.6431344Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.6431646Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.6431955Z     
2025-05-07T20:32:38.6432202Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6432558Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.6432871Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.6433194Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.6433570Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6433899Z     
2025-05-07T20:32:38.6434116Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.6434315Z 
2025-05-07T20:32:38.6434421Z moe/activation_test.py:126: 
2025-05-07T20:32:38.6434733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6435087Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.6435420Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6436509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.6437279Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.6437833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6438596Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6439294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.6440028Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.6440768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.6441410Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.6442025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.6442553Z     fn()
2025-05-07T20:32:38.6443064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.6443660Z     self.fn.run(
2025-05-07T20:32:38.6444137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6444687Z     kernel = self.compile(
2025-05-07T20:32:38.6445226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6445890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6446306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6446541Z 
2025-05-07T20:32:38.6446754Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea146f90>
2025-05-07T20:32:38.6447867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6449259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be93a1ee0>}
2025-05-07T20:32:38.6450611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6451647Z context = <triton._C.libtriton.ir.context object at 0x7f5be98d7730>
2025-05-07T20:32:38.6451940Z 
2025-05-07T20:32:38.6452117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6452650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6453129Z                            module_map=module_map)
2025-05-07T20:32:38.6453505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6453873Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.6454151Z E       ^
2025-05-07T20:32:38.6454628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6455082Z 
2025-05-07T20:32:38.6455504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.6456016Z 
2025-05-07T20:32:38.6456130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6456548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6456959Z     T=4096,
2025-05-07T20:32:38.6457162Z     D=5120,
2025-05-07T20:32:38.6457364Z     scale_ub=None,
2025-05-07T20:32:38.6457590Z     contiguous=True,
2025-05-07T20:32:38.6457820Z     compiled=True,
2025-05-07T20:32:38.6458032Z )
2025-05-07T20:32:38.8841195Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.8842436Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.8843786Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.8845229Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.8846216Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.8847529Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.8848922Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.8849911Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.8851142Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.8852527Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.8853617Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.8854907Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.8856171Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.8857402Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.8858620Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.8859456Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.8860488Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.8861513Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.8862318Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.8863615Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.8864910Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.8866375Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.8867429Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.8868621Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.8869987Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.8871066Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.8871994Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.8872740Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.8873765Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9539227Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.9541348Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.9544036Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.9546553Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.9547541Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.9548854Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.9550240Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9551241Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.9552475Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.9554204Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9555276Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.9556783Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.9558034Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.9559255Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.9560472Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.9561297Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.9562330Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:38.9563351Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.9564151Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:38.9565625Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.9566968Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.9568093Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:38.9569140Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.9570321Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.9571687Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.9572754Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9573676Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.9574424Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.9575445Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1597127Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.1598724Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:39.1600089Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.1601691Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.1602681Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1604136Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.1605697Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1606710Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1608017Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.1609728Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1611040Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1612629Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.1614176Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:39.1615429Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.1616642Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:39.1617475Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1618514Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.1619549Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:39.1620348Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.1621564Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.1622959Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.1624084Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.1625217Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:39.1626406Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.1627770Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.1628836Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1629753Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1630509Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:39.1631534Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1692360Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.1693652Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:39.1695001Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.1696430Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.1697414Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1698726Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.1700120Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1701115Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1702354Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.1703736Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1704814Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1706368Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.1707835Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:39.1709065Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.1710279Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:39.1711111Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1712139Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.1713161Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:39.1713966Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.1715184Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.1716567Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.1717697Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.1718746Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:39.1719939Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.1721300Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.1722361Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1723285Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1724029Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:39.1725057Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.4257343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.4258114Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.4258456Z 
2025-05-07T20:32:39.4258543Z     @given(
2025-05-07T20:32:39.4258782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.4259113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.4259429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.4259765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.4260457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.4260748Z     )
2025-05-07T20:32:39.4261118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.4261748Z     def test_silu_mul_quant(
2025-05-07T20:32:39.4262001Z         self,
2025-05-07T20:32:39.4262195Z         T: int,
2025-05-07T20:32:39.4262395Z         D: int,
2025-05-07T20:32:39.4262622Z         scale_ub: Optional[float],
2025-05-07T20:32:39.4262898Z         contiguous: bool,
2025-05-07T20:32:39.4263144Z         compiled: bool,
2025-05-07T20:32:39.4263375Z     ) -> None:
2025-05-07T20:32:39.4263593Z         torch.manual_seed(2025)
2025-05-07T20:32:39.4263849Z     
2025-05-07T20:32:39.4264130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.4264478Z     
2025-05-07T20:32:39.4264681Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.4264984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.4265300Z         x = x_sign * x_clamp
2025-05-07T20:32:39.4265921Z         x0 = x[:, :D]
2025-05-07T20:32:39.4266144Z         x1 = x[:, D:]
2025-05-07T20:32:39.4266352Z     
2025-05-07T20:32:39.4266552Z         if contiguous:
2025-05-07T20:32:39.4266784Z             x0 = x0.contiguous()
2025-05-07T20:32:39.4267044Z             x1 = x1.contiguous()
2025-05-07T20:32:39.4267286Z     
2025-05-07T20:32:39.4267483Z         if scale_ub is not None:
2025-05-07T20:32:39.4267762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.4268099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.4268415Z             )
2025-05-07T20:32:39.4268608Z         else:
2025-05-07T20:32:39.4268828Z             scale_ub_tensor = None
2025-05-07T20:32:39.4269092Z     
2025-05-07T20:32:39.4269333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.4269646Z             op = silu_mul_quant
2025-05-07T20:32:39.4269909Z             if compiled:
2025-05-07T20:32:39.4270165Z                 op = torch.compile(op)
2025-05-07T20:32:39.4270458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.4270738Z     
2025-05-07T20:32:39.4270937Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.4271225Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.4271520Z     
2025-05-07T20:32:39.4271766Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.4272099Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.4272399Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.4272719Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.4273078Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.4273398Z     
2025-05-07T20:32:39.4273604Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.4273799Z 
2025-05-07T20:32:39.4273910Z moe/activation_test.py:126: 
2025-05-07T20:32:39.4274213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4274561Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.4274898Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.4275771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.4276534Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.4277088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.4277776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.4278467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.4279200Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.4280075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.4280723Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.4281326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.4281956Z     fn()
2025-05-07T20:32:39.4282465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.4283043Z     self.fn.run(
2025-05-07T20:32:39.4283515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.4284049Z     kernel = self.compile(
2025-05-07T20:32:39.4284591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.4285239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.4285647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4285878Z 
2025-05-07T20:32:39.4286093Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9c98ec0>
2025-05-07T20:32:39.4287181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.4288569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be9722660>}
2025-05-07T20:32:39.4289911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.4290938Z context = <triton._C.libtriton.ir.context object at 0x7f5be8ff9230>
2025-05-07T20:32:39.4291232Z 
2025-05-07T20:32:39.4291407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.4291933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.4292422Z                            module_map=module_map)
2025-05-07T20:32:39.4292791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.4293156Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.4293425Z E       ^
2025-05-07T20:32:39.4293891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.4294350Z 
2025-05-07T20:32:39.4294772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.4295282Z 
2025-05-07T20:32:39.4295388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.4295814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.4296238Z     T=16384,
2025-05-07T20:32:39.4296439Z     D=5120,
2025-05-07T20:32:39.4296646Z     scale_ub=None,
2025-05-07T20:32:39.4296872Z     contiguous=True,
2025-05-07T20:32:39.4304814Z     compiled=True,
2025-05-07T20:32:39.4305047Z )
2025-05-07T20:32:39.4590668Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:39.4591917Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:39.4593270Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:39.4594583Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:39.4595791Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:39.5476572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.5477269Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.5477551Z 
2025-05-07T20:32:39.5477634Z     @given(
2025-05-07T20:32:39.5477879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.5478203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.5478514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.5478852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.5479195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.5479481Z     )
2025-05-07T20:32:39.5479854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.5480307Z     def test_silu_mul_quant(
2025-05-07T20:32:39.5480555Z         self,
2025-05-07T20:32:39.5480765Z         T: int,
2025-05-07T20:32:39.5480980Z         D: int,
2025-05-07T20:32:39.5481199Z         scale_ub: Optional[float],
2025-05-07T20:32:39.5481480Z         contiguous: bool,
2025-05-07T20:32:39.5481729Z         compiled: bool,
2025-05-07T20:32:39.5481961Z     ) -> None:
2025-05-07T20:32:39.5482185Z         torch.manual_seed(2025)
2025-05-07T20:32:39.5482433Z     
2025-05-07T20:32:39.5482705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.5483055Z     
2025-05-07T20:32:39.5483258Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.5483548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.5483867Z         x = x_sign * x_clamp
2025-05-07T20:32:39.5484119Z         x0 = x[:, :D]
2025-05-07T20:32:39.5484346Z         x1 = x[:, D:]
2025-05-07T20:32:39.5484558Z     
2025-05-07T20:32:39.5484757Z         if contiguous:
2025-05-07T20:32:39.5485000Z             x0 = x0.contiguous()
2025-05-07T20:32:39.5485258Z             x1 = x1.contiguous()
2025-05-07T20:32:39.5485516Z     
2025-05-07T20:32:39.5485716Z         if scale_ub is not None:
2025-05-07T20:32:39.5485989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.5486338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.5486670Z             )
2025-05-07T20:32:39.5486863Z         else:
2025-05-07T20:32:39.5487086Z             scale_ub_tensor = None
2025-05-07T20:32:39.5487344Z     
2025-05-07T20:32:39.5487576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5487902Z             op = silu_mul_quant
2025-05-07T20:32:39.5488167Z             if compiled:
2025-05-07T20:32:39.5488413Z                 op = torch.compile(op)
2025-05-07T20:32:39.5488714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5488994Z     
2025-05-07T20:32:39.5489191Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.5489484Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.5489781Z     
2025-05-07T20:32:39.5490022Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5490362Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.5490660Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.5490982Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.5491337Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.5491655Z     
2025-05-07T20:32:39.5491864Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.5492059Z 
2025-05-07T20:32:39.5492164Z moe/activation_test.py:126: 
2025-05-07T20:32:39.5492469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5492815Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.5493490Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.5494289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.5495181Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.5495732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.5496413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.5497109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.5497834Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.5498571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.5499209Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.5499823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.5500352Z     fn()
2025-05-07T20:32:39.5500869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.5501455Z     self.fn.run(
2025-05-07T20:32:39.5501928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.5502467Z     kernel = self.compile(
2025-05-07T20:32:39.5503001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.5503658Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.5504062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5504292Z 
2025-05-07T20:32:39.5504512Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d8fef0>
2025-05-07T20:32:39.5505591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.5507038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be91474c0>}
2025-05-07T20:32:39.5508385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.5509419Z context = <triton._C.libtriton.ir.context object at 0x7f5be8724bb0>
2025-05-07T20:32:39.5509715Z 
2025-05-07T20:32:39.5509892Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.5510422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.5510906Z                            module_map=module_map)
2025-05-07T20:32:39.5511278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.5511642Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.5511919Z E       ^
2025-05-07T20:32:39.5512390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5512846Z 
2025-05-07T20:32:39.5513266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.5513777Z 
2025-05-07T20:32:39.5513884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.5514304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.5514717Z     T=1,
2025-05-07T20:32:39.5514901Z     D=5120,
2025-05-07T20:32:39.5515109Z     scale_ub=1200.0,
2025-05-07T20:32:39.5515428Z     contiguous=True,
2025-05-07T20:32:39.5515656Z     compiled=True,
2025-05-07T20:32:39.5515957Z )
2025-05-07T20:32:39.6918568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.6919696Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.6919972Z 
2025-05-07T20:32:39.6920055Z     @given(
2025-05-07T20:32:39.6920298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.6920623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.6920931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.6921275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.6921621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.6921923Z     )
2025-05-07T20:32:39.6922276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.6922731Z     def test_silu_mul_quant(
2025-05-07T20:32:39.6922989Z         self,
2025-05-07T20:32:39.6923190Z         T: int,
2025-05-07T20:32:39.6923401Z         D: int,
2025-05-07T20:32:39.6923631Z         scale_ub: Optional[float],
2025-05-07T20:32:39.6923903Z         contiguous: bool,
2025-05-07T20:32:39.6924162Z         compiled: bool,
2025-05-07T20:32:39.6924398Z     ) -> None:
2025-05-07T20:32:39.6924621Z         torch.manual_seed(2025)
2025-05-07T20:32:39.6924862Z     
2025-05-07T20:32:39.6925138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.6925492Z     
2025-05-07T20:32:39.6925686Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.6925983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.6926308Z         x = x_sign * x_clamp
2025-05-07T20:32:39.6926555Z         x0 = x[:, :D]
2025-05-07T20:32:39.6926811Z         x1 = x[:, D:]
2025-05-07T20:32:39.6927049Z     
2025-05-07T20:32:39.6927245Z         if contiguous:
2025-05-07T20:32:39.6927482Z             x0 = x0.contiguous()
2025-05-07T20:32:39.6927748Z             x1 = x1.contiguous()
2025-05-07T20:32:39.6927998Z     
2025-05-07T20:32:39.6928187Z         if scale_ub is not None:
2025-05-07T20:32:39.6928463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.6928811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.6929122Z             )
2025-05-07T20:32:39.6929326Z         else:
2025-05-07T20:32:39.6929541Z             scale_ub_tensor = None
2025-05-07T20:32:39.6929792Z     
2025-05-07T20:32:39.6930029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.6930352Z             op = silu_mul_quant
2025-05-07T20:32:39.6930605Z             if compiled:
2025-05-07T20:32:39.6930863Z                 op = torch.compile(op)
2025-05-07T20:32:39.6931162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6931436Z     
2025-05-07T20:32:39.6931635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.6931805Z 
2025-05-07T20:32:39.6931908Z moe/activation_test.py:117: 
2025-05-07T20:32:39.6932214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6932549Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.6932833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6933398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.6933959Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.6934619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.6935309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.6935848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.6936524Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.6937403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.6937946Z     kernel = self.compile(
2025-05-07T20:32:39.6938485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.6939250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.6939662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6939897Z 
2025-05-07T20:32:39.6940112Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b5a3f0>
2025-05-07T20:32:39.6941192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.6942590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8bde5c0>}
2025-05-07T20:32:39.6943941Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.6944982Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc0cd30>
2025-05-07T20:32:39.6945270Z 
2025-05-07T20:32:39.6945448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.6945970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.6946449Z                            module_map=module_map)
2025-05-07T20:32:39.6946817Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.6947175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.6947447Z E       ^
2025-05-07T20:32:39.6947922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.6948375Z 
2025-05-07T20:32:39.6948792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.6949306Z 
2025-05-07T20:32:39.6949411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6949828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6950234Z     T=1,
2025-05-07T20:32:39.6950417Z     D=5120,
2025-05-07T20:32:39.6950615Z     scale_ub=None,
2025-05-07T20:32:39.6950843Z     contiguous=False,
2025-05-07T20:32:39.6951070Z     compiled=True,
2025-05-07T20:32:39.6951278Z )
2025-05-07T20:32:39.7564536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7565285Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.7565906Z 
2025-05-07T20:32:39.7566028Z     @given(
2025-05-07T20:32:39.7566356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7566862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7567245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7567573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7567975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7568376Z     )
2025-05-07T20:32:39.7568843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7569423Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7569672Z         self,
2025-05-07T20:32:39.7569867Z         T: int,
2025-05-07T20:32:39.7570068Z         D: int,
2025-05-07T20:32:39.7570290Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7570559Z         contiguous: bool,
2025-05-07T20:32:39.7570809Z         compiled: bool,
2025-05-07T20:32:39.7571042Z     ) -> None:
2025-05-07T20:32:39.7571256Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7571505Z     
2025-05-07T20:32:39.7572073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7572445Z     
2025-05-07T20:32:39.7572641Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7572944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7573394Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7573645Z         x0 = x[:, :D]
2025-05-07T20:32:39.7573863Z         x1 = x[:, D:]
2025-05-07T20:32:39.7574078Z     
2025-05-07T20:32:39.7574272Z         if contiguous:
2025-05-07T20:32:39.7574505Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7574778Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7575029Z     
2025-05-07T20:32:39.7575226Z         if scale_ub is not None:
2025-05-07T20:32:39.7575513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7575851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7576163Z             )
2025-05-07T20:32:39.7576362Z         else:
2025-05-07T20:32:39.7576584Z             scale_ub_tensor = None
2025-05-07T20:32:39.7576837Z     
2025-05-07T20:32:39.7577083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7577402Z             op = silu_mul_quant
2025-05-07T20:32:39.7577649Z             if compiled:
2025-05-07T20:32:39.7577910Z                 op = torch.compile(op)
2025-05-07T20:32:39.7578209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7578485Z     
2025-05-07T20:32:39.7578677Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.7578964Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.7579258Z     
2025-05-07T20:32:39.7579495Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7579834Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.7580134Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.7580446Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.7580807Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7581128Z     
2025-05-07T20:32:39.7581332Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.7581533Z 
2025-05-07T20:32:39.7581636Z moe/activation_test.py:126: 
2025-05-07T20:32:39.7581940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7582289Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.7582617Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7583407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.7584165Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.7584707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7585391Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7586085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.7586811Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.7587536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.7588180Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.7588783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.7589304Z     fn()
2025-05-07T20:32:39.7589808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.7590393Z     self.fn.run(
2025-05-07T20:32:39.7590862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7591389Z     kernel = self.compile(
2025-05-07T20:32:39.7592013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7592673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7593076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7593380Z 
2025-05-07T20:32:39.7593587Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8c2ecc0>
2025-05-07T20:32:39.7594671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7596140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c51b20>}
2025-05-07T20:32:39.7597491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7598511Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc62c70>
2025-05-07T20:32:39.7598813Z 
2025-05-07T20:32:39.7598981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7599511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7599984Z                            module_map=module_map)
2025-05-07T20:32:39.7600346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7600708Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.7600978Z E       ^
2025-05-07T20:32:39.7601439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7601897Z 
2025-05-07T20:32:39.7602312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7602826Z 
2025-05-07T20:32:39.7602930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7603347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7603754Z     T=1,
2025-05-07T20:32:39.7603941Z     D=5120,
2025-05-07T20:32:39.7604137Z     scale_ub=None,
2025-05-07T20:32:39.7604348Z     contiguous=True,
2025-05-07T20:32:39.7604574Z     compiled=False,
2025-05-07T20:32:39.7604781Z )
2025-05-07T20:32:39.9112354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9113079Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.9113441Z 
2025-05-07T20:32:39.9113556Z     @given(
2025-05-07T20:32:39.9113896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9114212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9114527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9114882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9115209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9115509Z     )
2025-05-07T20:32:39.9115962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9116418Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9116665Z         self,
2025-05-07T20:32:39.9116895Z         T: int,
2025-05-07T20:32:39.9117113Z         D: int,
2025-05-07T20:32:39.9117333Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9117612Z         contiguous: bool,
2025-05-07T20:32:39.9117859Z         compiled: bool,
2025-05-07T20:32:39.9118088Z     ) -> None:
2025-05-07T20:32:39.9118307Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9118553Z     
2025-05-07T20:32:39.9118824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9119172Z     
2025-05-07T20:32:39.9119370Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9119998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9120320Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9120568Z         x0 = x[:, :D]
2025-05-07T20:32:39.9120786Z         x1 = x[:, D:]
2025-05-07T20:32:39.9121158Z     
2025-05-07T20:32:39.9121349Z         if contiguous:
2025-05-07T20:32:39.9121581Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9121842Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9122087Z     
2025-05-07T20:32:39.9122279Z         if scale_ub is not None:
2025-05-07T20:32:39.9122556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9122896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9123210Z             )
2025-05-07T20:32:39.9123401Z         else:
2025-05-07T20:32:39.9123620Z             scale_ub_tensor = None
2025-05-07T20:32:39.9123878Z     
2025-05-07T20:32:39.9124110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9124435Z             op = silu_mul_quant
2025-05-07T20:32:39.9124694Z             if compiled:
2025-05-07T20:32:39.9124939Z                 op = torch.compile(op)
2025-05-07T20:32:39.9125240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9125527Z     
2025-05-07T20:32:39.9125724Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9125895Z 
2025-05-07T20:32:39.9126001Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9126305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9126634Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9126919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9127615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9128305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9128841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9129531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9130198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9130744Z     kernel = self.compile(
2025-05-07T20:32:39.9131282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9131946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9132350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9132584Z 
2025-05-07T20:32:39.9132790Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be86198b0>
2025-05-07T20:32:39.9133880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9135354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c52d40>}
2025-05-07T20:32:39.9136703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9137727Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc2dc70>
2025-05-07T20:32:39.9138014Z 
2025-05-07T20:32:39.9138189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9138708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9139182Z                            module_map=module_map)
2025-05-07T20:32:39.9139546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9139900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9140265Z E       ^
2025-05-07T20:32:39.9140734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9141266Z 
2025-05-07T20:32:39.9141687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9142196Z 
2025-05-07T20:32:39.9142302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9142716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9143121Z     T=128,
2025-05-07T20:32:39.9143308Z     D=5120,
2025-05-07T20:32:39.9143506Z     scale_ub=None,
2025-05-07T20:32:39.9143727Z     contiguous=False,
2025-05-07T20:32:39.9143970Z     compiled=True,
2025-05-07T20:32:39.9152551Z )
2025-05-07T20:32:39.9152901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9153416Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.9153709Z 
2025-05-07T20:32:39.9153791Z     @given(
2025-05-07T20:32:39.9154033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9154347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9154670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9155005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9155330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9155610Z     )
2025-05-07T20:32:39.9156045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9156498Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9156757Z         self,
2025-05-07T20:32:39.9156996Z         T: int,
2025-05-07T20:32:39.9157201Z         D: int,
2025-05-07T20:32:39.9157414Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9157692Z         contiguous: bool,
2025-05-07T20:32:39.9157935Z         compiled: bool,
2025-05-07T20:32:39.9158171Z     ) -> None:
2025-05-07T20:32:39.9158393Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9158639Z     
2025-05-07T20:32:39.9158923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9159280Z     
2025-05-07T20:32:39.9159480Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9159779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9160102Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9160348Z         x0 = x[:, :D]
2025-05-07T20:32:39.9160576Z         x1 = x[:, D:]
2025-05-07T20:32:39.9160796Z     
2025-05-07T20:32:39.9160996Z         if contiguous:
2025-05-07T20:32:39.9161228Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9161497Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9161746Z     
2025-05-07T20:32:39.9161941Z         if scale_ub is not None:
2025-05-07T20:32:39.9162229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9162570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9162885Z             )
2025-05-07T20:32:39.9163090Z         else:
2025-05-07T20:32:39.9163313Z             scale_ub_tensor = None
2025-05-07T20:32:39.9163563Z     
2025-05-07T20:32:39.9163802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9164127Z             op = silu_mul_quant
2025-05-07T20:32:39.9164380Z             if compiled:
2025-05-07T20:32:39.9164637Z                 op = torch.compile(op)
2025-05-07T20:32:39.9164950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9165238Z     
2025-05-07T20:32:39.9165782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9165955Z 
2025-05-07T20:32:39.9166057Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9166362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9166697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9166990Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9167770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.9168337Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.9168997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9169807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9170346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9171020Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9171685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9172218Z     kernel = self.compile(
2025-05-07T20:32:39.9172755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9173416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9173819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9174049Z 
2025-05-07T20:32:39.9174267Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9119ac0>
2025-05-07T20:32:39.9175353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9176741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c50f40>}
2025-05-07T20:32:39.9178087Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9179121Z context = <triton._C.libtriton.ir.context object at 0x7f5be84476b0>
2025-05-07T20:32:39.9179411Z 
2025-05-07T20:32:39.9179587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9180114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9180589Z                            module_map=module_map)
2025-05-07T20:32:39.9180966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9181321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9181589Z E       ^
2025-05-07T20:32:39.9182058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9182510Z 
2025-05-07T20:32:39.9182933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9183443Z 
2025-05-07T20:32:39.9183549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9183974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9184387Z     T=128,
2025-05-07T20:32:39.9184578Z     D=7168,
2025-05-07T20:32:39.9184781Z     scale_ub=1200.0,
2025-05-07T20:32:39.9185018Z     contiguous=False,
2025-05-07T20:32:39.9185254Z     compiled=False,
2025-05-07T20:32:39.9185460Z )
2025-05-07T20:32:40.0324467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.0325957Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.0326710Z 
2025-05-07T20:32:40.0326874Z     @given(
2025-05-07T20:32:40.0327235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.0327569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.0327885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.0328227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.0328936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.0329240Z     )
2025-05-07T20:32:40.0329594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.0330053Z     def test_silu_mul_quant(
2025-05-07T20:32:40.0330463Z         self,
2025-05-07T20:32:40.0330664Z         T: int,
2025-05-07T20:32:40.0330874Z         D: int,
2025-05-07T20:32:40.0331105Z         scale_ub: Optional[float],
2025-05-07T20:32:40.0331380Z         contiguous: bool,
2025-05-07T20:32:40.0331631Z         compiled: bool,
2025-05-07T20:32:40.0331866Z     ) -> None:
2025-05-07T20:32:40.0332087Z         torch.manual_seed(2025)
2025-05-07T20:32:40.0332339Z     
2025-05-07T20:32:40.0332623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.0333005Z     
2025-05-07T20:32:40.0333210Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.0333506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.0333833Z         x = x_sign * x_clamp
2025-05-07T20:32:40.0334091Z         x0 = x[:, :D]
2025-05-07T20:32:40.0334312Z         x1 = x[:, D:]
2025-05-07T20:32:40.0334537Z     
2025-05-07T20:32:40.0334732Z         if contiguous:
2025-05-07T20:32:40.0334970Z             x0 = x0.contiguous()
2025-05-07T20:32:40.0335252Z             x1 = x1.contiguous()
2025-05-07T20:32:40.0335501Z     
2025-05-07T20:32:40.0335694Z         if scale_ub is not None:
2025-05-07T20:32:40.0335978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.0336325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.0336644Z             )
2025-05-07T20:32:40.0336840Z         else:
2025-05-07T20:32:40.0337059Z             scale_ub_tensor = None
2025-05-07T20:32:40.0337324Z     
2025-05-07T20:32:40.0337562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.0337890Z             op = silu_mul_quant
2025-05-07T20:32:40.0338151Z             if compiled:
2025-05-07T20:32:40.0338401Z                 op = torch.compile(op)
2025-05-07T20:32:40.0338713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0338997Z     
2025-05-07T20:32:40.0339194Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.0339368Z 
2025-05-07T20:32:40.0339472Z moe/activation_test.py:117: 
2025-05-07T20:32:40.0339789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0340128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.0340421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0341120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.0341811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.0342362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.0343054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.0343729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.0344272Z     kernel = self.compile(
2025-05-07T20:32:40.0344824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.0345487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.0345888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0346126Z 
2025-05-07T20:32:40.0346336Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be861af00>
2025-05-07T20:32:40.0347474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.0348978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8660e00>}
2025-05-07T20:32:40.0350320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.0351428Z context = <triton._C.libtriton.ir.context object at 0x7f5be8427f70>
2025-05-07T20:32:40.0351722Z 
2025-05-07T20:32:40.0351890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.0352416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.0352888Z                            module_map=module_map)
2025-05-07T20:32:40.0353256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.0353617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.0353884Z E       ^
2025-05-07T20:32:40.0354352Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.0354813Z 
2025-05-07T20:32:40.0355227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.0355837Z 
2025-05-07T20:32:40.0355952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.0356372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.0356783Z     T=128,
2025-05-07T20:32:40.0356984Z     D=5120,
2025-05-07T20:32:40.0357191Z     scale_ub=None,
2025-05-07T20:32:40.0357412Z     contiguous=False,
2025-05-07T20:32:40.0357646Z     compiled=False,
2025-05-07T20:32:40.0357863Z )
2025-05-07T20:32:40.0358186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.0358687Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.0358959Z 
2025-05-07T20:32:40.0359050Z     @given(
2025-05-07T20:32:40.0359293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.0359618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.0359935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.0360281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.0360616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.0360911Z     )
2025-05-07T20:32:40.0361268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.0361715Z     def test_silu_mul_quant(
2025-05-07T20:32:40.0361969Z         self,
2025-05-07T20:32:40.0362174Z         T: int,
2025-05-07T20:32:40.0362376Z         D: int,
2025-05-07T20:32:40.0362606Z         scale_ub: Optional[float],
2025-05-07T20:32:40.0362885Z         contiguous: bool,
2025-05-07T20:32:40.0363128Z         compiled: bool,
2025-05-07T20:32:40.0363359Z     ) -> None:
2025-05-07T20:32:40.0363583Z         torch.manual_seed(2025)
2025-05-07T20:32:40.0363825Z     
2025-05-07T20:32:40.0364118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.0364476Z     
2025-05-07T20:32:40.0364676Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.0364983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.0365310Z         x = x_sign * x_clamp
2025-05-07T20:32:40.0369588Z         x0 = x[:, :D]
2025-05-07T20:32:40.0369816Z         x1 = x[:, D:]
2025-05-07T20:32:40.0370035Z     
2025-05-07T20:32:40.0370226Z         if contiguous:
2025-05-07T20:32:40.0370464Z             x0 = x0.contiguous()
2025-05-07T20:32:40.0370726Z             x1 = x1.contiguous()
2025-05-07T20:32:40.0370970Z     
2025-05-07T20:32:40.0371164Z         if scale_ub is not None:
2025-05-07T20:32:40.0371446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.0371787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.0372099Z             )
2025-05-07T20:32:40.0372297Z         else:
2025-05-07T20:32:40.0372679Z             scale_ub_tensor = None
2025-05-07T20:32:40.0372936Z     
2025-05-07T20:32:40.0373175Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.0373497Z             op = silu_mul_quant
2025-05-07T20:32:40.0373860Z             if compiled:
2025-05-07T20:32:40.0374113Z                 op = torch.compile(op)
2025-05-07T20:32:40.0374413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0374694Z     
2025-05-07T20:32:40.0374890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.0375062Z 
2025-05-07T20:32:40.0375166Z moe/activation_test.py:117: 
2025-05-07T20:32:40.0375466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0375798Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.0376083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.0376777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.0377518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.0378059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.0378741Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.0379413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.0379946Z     kernel = self.compile(
2025-05-07T20:32:40.0380496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.0381159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.0381568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.0381801Z 
2025-05-07T20:32:40.0382009Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8651610>
2025-05-07T20:32:40.0383098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.0384480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8662980>}
2025-05-07T20:32:40.0385829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.0386862Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfebcb0>
2025-05-07T20:32:40.0387166Z 
2025-05-07T20:32:40.0387335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.0387866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.0388354Z                            module_map=module_map)
2025-05-07T20:32:40.0388722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.0389091Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.0389374Z E       ^
2025-05-07T20:32:40.0389840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.0390304Z 
2025-05-07T20:32:40.0390721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.0391238Z 
2025-05-07T20:32:40.0391347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.0391779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.0392184Z     T=128,
2025-05-07T20:32:40.0392387Z     D=5120,
2025-05-07T20:32:40.0392592Z     scale_ub=1200.0,
2025-05-07T20:32:40.0392819Z     contiguous=True,
2025-05-07T20:32:40.0393051Z     compiled=False,
2025-05-07T20:32:40.0393774Z )
2025-05-07T20:32:40.4218367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.4219002Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:40.4219668Z 
2025-05-07T20:32:40.4219752Z     @given(
2025-05-07T20:32:40.4219999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.4220317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.4220637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.4220977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.4221301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.4221592Z     )
2025-05-07T20:32:40.4221951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.4222398Z     def test_silu_mul_quant(
2025-05-07T20:32:40.4222662Z         self,
2025-05-07T20:32:40.4222874Z         T: int,
2025-05-07T20:32:40.4223090Z         D: int,
2025-05-07T20:32:40.4223318Z         scale_ub: Optional[float],
2025-05-07T20:32:40.4223597Z         contiguous: bool,
2025-05-07T20:32:40.4223847Z         compiled: bool,
2025-05-07T20:32:40.4224079Z     ) -> None:
2025-05-07T20:32:40.4224305Z         torch.manual_seed(2025)
2025-05-07T20:32:40.4224560Z     
2025-05-07T20:32:40.4224837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.4225193Z     
2025-05-07T20:32:40.4225394Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.4225685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.4226009Z         x = x_sign * x_clamp
2025-05-07T20:32:40.4226259Z         x0 = x[:, :D]
2025-05-07T20:32:40.4226477Z         x1 = x[:, D:]
2025-05-07T20:32:40.4226693Z     
2025-05-07T20:32:40.4226890Z         if contiguous:
2025-05-07T20:32:40.4227126Z             x0 = x0.contiguous()
2025-05-07T20:32:40.4227392Z             x1 = x1.contiguous()
2025-05-07T20:32:40.4227642Z     
2025-05-07T20:32:40.4227843Z         if scale_ub is not None:
2025-05-07T20:32:40.4228121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.4228468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.4228794Z             )
2025-05-07T20:32:40.4228987Z         else:
2025-05-07T20:32:40.4229204Z             scale_ub_tensor = None
2025-05-07T20:32:40.4229461Z     
2025-05-07T20:32:40.4229692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.4230011Z             op = silu_mul_quant
2025-05-07T20:32:40.4230265Z             if compiled:
2025-05-07T20:32:40.4230509Z                 op = torch.compile(op)
2025-05-07T20:32:40.4230806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4231085Z     
2025-05-07T20:32:40.4231277Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.4231447Z 
2025-05-07T20:32:40.4231550Z moe/activation_test.py:117: 
2025-05-07T20:32:40.4231852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4232199Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.4232488Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4233188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.4233885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.4234418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.4235105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.4235878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.4236412Z     kernel = self.compile(
2025-05-07T20:32:40.4236954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.4237774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.4238180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4238411Z 
2025-05-07T20:32:40.4238618Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8653d40>
2025-05-07T20:32:40.4239780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.4241271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be86602c0>}
2025-05-07T20:32:40.4242620Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.4243651Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bdbf7f0>
2025-05-07T20:32:40.4243940Z 
2025-05-07T20:32:40.4244106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.4244639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.4245114Z                            module_map=module_map)
2025-05-07T20:32:40.4245482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.4245837Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.4246101Z E       ^
2025-05-07T20:32:40.4246569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.4247018Z 
2025-05-07T20:32:40.4247439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4247948Z 
2025-05-07T20:32:40.4248053Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4248471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4248882Z     T=1,
2025-05-07T20:32:40.4249063Z     D=7168,
2025-05-07T20:32:40.4249271Z     scale_ub=1200.0,
2025-05-07T20:32:40.4249499Z     contiguous=True,
2025-05-07T20:32:40.4249719Z     compiled=True,
2025-05-07T20:32:40.4249931Z )
2025-05-07T20:32:40.4250258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.4250741Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.4251009Z 
2025-05-07T20:32:40.4251089Z     @given(
2025-05-07T20:32:40.4251332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.4251651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.4251957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.4252302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.4252645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.4252929Z     )
2025-05-07T20:32:40.4253281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.4253730Z     def test_silu_mul_quant(
2025-05-07T20:32:40.4253978Z         self,
2025-05-07T20:32:40.4254184Z         T: int,
2025-05-07T20:32:40.4254391Z         D: int,
2025-05-07T20:32:40.4254608Z         scale_ub: Optional[float],
2025-05-07T20:32:40.4254889Z         contiguous: bool,
2025-05-07T20:32:40.4255142Z         compiled: bool,
2025-05-07T20:32:40.4255371Z     ) -> None:
2025-05-07T20:32:40.4255589Z         torch.manual_seed(2025)
2025-05-07T20:32:40.4255836Z     
2025-05-07T20:32:40.4256118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.4256462Z     
2025-05-07T20:32:40.4256657Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.4256953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.4257261Z         x = x_sign * x_clamp
2025-05-07T20:32:40.4257597Z         x0 = x[:, :D]
2025-05-07T20:32:40.4257823Z         x1 = x[:, D:]
2025-05-07T20:32:40.4258029Z     
2025-05-07T20:32:40.4258219Z         if contiguous:
2025-05-07T20:32:40.4258455Z             x0 = x0.contiguous()
2025-05-07T20:32:40.4258787Z             x1 = x1.contiguous()
2025-05-07T20:32:40.4259033Z     
2025-05-07T20:32:40.4259230Z         if scale_ub is not None:
2025-05-07T20:32:40.4259498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.4259840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.4260156Z             )
2025-05-07T20:32:40.4260347Z         else:
2025-05-07T20:32:40.4260568Z             scale_ub_tensor = None
2025-05-07T20:32:40.4260821Z     
2025-05-07T20:32:40.4269250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.4269687Z             op = silu_mul_quant
2025-05-07T20:32:40.4269950Z             if compiled:
2025-05-07T20:32:40.4270196Z                 op = torch.compile(op)
2025-05-07T20:32:40.4270502Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4270787Z     
2025-05-07T20:32:40.4270983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.4271158Z 
2025-05-07T20:32:40.4271262Z moe/activation_test.py:117: 
2025-05-07T20:32:40.4271578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4271918Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.4272200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4272973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.4273540Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.4274198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.4274891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.4275444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.4276224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.4276889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.4277491Z     kernel = self.compile(
2025-05-07T20:32:40.4278040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.4278710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.4279113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4279356Z 
2025-05-07T20:32:40.4279567Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8651310>
2025-05-07T20:32:40.4280661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.4282163Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8662ca0>}
2025-05-07T20:32:40.4283587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.4284615Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bd0e3f0>
2025-05-07T20:32:40.4284910Z 
2025-05-07T20:32:40.4285078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.4285603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.4286077Z                            module_map=module_map)
2025-05-07T20:32:40.4286649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.4287015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.4287274Z E       ^
2025-05-07T20:32:40.4287741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.4288326Z 
2025-05-07T20:32:40.4288740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4289248Z 
2025-05-07T20:32:40.4289362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4289770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4290178Z     T=1,
2025-05-07T20:32:40.4290374Z     D=7168,
2025-05-07T20:32:40.4290572Z     scale_ub=1200.0,
2025-05-07T20:32:40.4290800Z     contiguous=False,
2025-05-07T20:32:40.4291035Z     compiled=True,
2025-05-07T20:32:40.4291238Z )
2025-05-07T20:32:40.5680392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.5680957Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.5681236Z 
2025-05-07T20:32:40.5681324Z     @given(
2025-05-07T20:32:40.5681567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.5681908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.5682227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.5682562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.5682901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.5683202Z     )
2025-05-07T20:32:40.5683556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.5684018Z     def test_silu_mul_quant(
2025-05-07T20:32:40.5684276Z         self,
2025-05-07T20:32:40.5684474Z         T: int,
2025-05-07T20:32:40.5684683Z         D: int,
2025-05-07T20:32:40.5684912Z         scale_ub: Optional[float],
2025-05-07T20:32:40.5685185Z         contiguous: bool,
2025-05-07T20:32:40.5685449Z         compiled: bool,
2025-05-07T20:32:40.5685691Z     ) -> None:
2025-05-07T20:32:40.5685912Z         torch.manual_seed(2025)
2025-05-07T20:32:40.5686172Z     
2025-05-07T20:32:40.5686460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.5686814Z     
2025-05-07T20:32:40.5687010Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.5687311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.5687633Z         x = x_sign * x_clamp
2025-05-07T20:32:40.5687879Z         x0 = x[:, :D]
2025-05-07T20:32:40.5688110Z         x1 = x[:, D:]
2025-05-07T20:32:40.5688328Z     
2025-05-07T20:32:40.5688516Z         if contiguous:
2025-05-07T20:32:40.5688757Z             x0 = x0.contiguous()
2025-05-07T20:32:40.5689027Z             x1 = x1.contiguous()
2025-05-07T20:32:40.5689270Z     
2025-05-07T20:32:40.5689480Z         if scale_ub is not None:
2025-05-07T20:32:40.5689766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.5690111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.5690430Z             )
2025-05-07T20:32:40.5690633Z         else:
2025-05-07T20:32:40.5690846Z             scale_ub_tensor = None
2025-05-07T20:32:40.5691112Z     
2025-05-07T20:32:40.5691356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.5691684Z             op = silu_mul_quant
2025-05-07T20:32:40.5691938Z             if compiled:
2025-05-07T20:32:40.5692197Z                 op = torch.compile(op)
2025-05-07T20:32:40.5692509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5692787Z     
2025-05-07T20:32:40.5692996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.5693164Z 
2025-05-07T20:32:40.5693273Z moe/activation_test.py:117: 
2025-05-07T20:32:40.5693576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5693923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.5694218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5695076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.5695653Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.5696453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.5697164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.5697739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.5698431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.5699104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.5699649Z     kernel = self.compile(
2025-05-07T20:32:40.5700189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.5700860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.5701271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5701510Z 
2025-05-07T20:32:40.5701720Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8653320>
2025-05-07T20:32:40.5702814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.5704214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8bdd0e00>}
2025-05-07T20:32:40.5705572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.5706612Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfb2e30>
2025-05-07T20:32:40.5706901Z 
2025-05-07T20:32:40.5707071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.5707611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.5708091Z                            module_map=module_map)
2025-05-07T20:32:40.5708472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.5708831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.5709100Z E       ^
2025-05-07T20:32:40.5709577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.5710029Z 
2025-05-07T20:32:40.5710447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.5710965Z 
2025-05-07T20:32:40.5711078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.5711499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.5711919Z     T=1,
2025-05-07T20:32:40.5712112Z     D=7168,
2025-05-07T20:32:40.5712318Z     scale_ub=None,
2025-05-07T20:32:40.5712547Z     contiguous=False,
2025-05-07T20:32:40.5712776Z     compiled=True,
2025-05-07T20:32:40.5712999Z )
2025-05-07T20:32:40.6610309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6610857Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.6611123Z 
2025-05-07T20:32:40.6611211Z     @given(
2025-05-07T20:32:40.6611450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6611784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6612096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6612428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6613090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6613385Z     )
2025-05-07T20:32:40.6613728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6614319Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6614570Z         self,
2025-05-07T20:32:40.6614770Z         T: int,
2025-05-07T20:32:40.6614962Z         D: int,
2025-05-07T20:32:40.6615184Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6615455Z         contiguous: bool,
2025-05-07T20:32:40.6615687Z         compiled: bool,
2025-05-07T20:32:40.6615916Z     ) -> None:
2025-05-07T20:32:40.6616135Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6616372Z     
2025-05-07T20:32:40.6616651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6616998Z     
2025-05-07T20:32:40.6617193Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6617482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6617802Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6618039Z         x0 = x[:, :D]
2025-05-07T20:32:40.6618262Z         x1 = x[:, D:]
2025-05-07T20:32:40.6618470Z     
2025-05-07T20:32:40.6618650Z         if contiguous:
2025-05-07T20:32:40.6618893Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6619161Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6619397Z     
2025-05-07T20:32:40.6619594Z         if scale_ub is not None:
2025-05-07T20:32:40.6619872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6620216Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6620526Z             )
2025-05-07T20:32:40.6620732Z         else:
2025-05-07T20:32:40.6620952Z             scale_ub_tensor = None
2025-05-07T20:32:40.6621198Z     
2025-05-07T20:32:40.6621433Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6621755Z             op = silu_mul_quant
2025-05-07T20:32:40.6622001Z             if compiled:
2025-05-07T20:32:40.6622257Z                 op = torch.compile(op)
2025-05-07T20:32:40.6622558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6622829Z     
2025-05-07T20:32:40.6623027Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.6623321Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.6623607Z     
2025-05-07T20:32:40.6623851Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6624189Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.6624484Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.6624792Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.6625153Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6625468Z     
2025-05-07T20:32:40.6625669Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.6625867Z 
2025-05-07T20:32:40.6625969Z moe/activation_test.py:126: 
2025-05-07T20:32:40.6626273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6626609Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.6626936Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6627717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.6628469Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.6629009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6629688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6630374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.6631090Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.6631909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.6632546Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.6633149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.6633765Z     fn()
2025-05-07T20:32:40.6634273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.6634995Z     self.fn.run(
2025-05-07T20:32:40.6635470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6636163Z     kernel = self.compile(
2025-05-07T20:32:40.6636704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6637353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6637752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6637991Z 
2025-05-07T20:32:40.6638198Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8582c90>
2025-05-07T20:32:40.6639278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6640666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be858e8e0>}
2025-05-07T20:32:40.6642002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6643019Z context = <triton._C.libtriton.ir.context object at 0x7f5be977d330>
2025-05-07T20:32:40.6643314Z 
2025-05-07T20:32:40.6643485Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6644007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6644483Z                            module_map=module_map)
2025-05-07T20:32:40.6644839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6645197Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.6645469Z E       ^
2025-05-07T20:32:40.6645928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6646387Z 
2025-05-07T20:32:40.6646800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6647318Z 
2025-05-07T20:32:40.6647423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6647845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6648248Z     T=1,
2025-05-07T20:32:40.6648441Z     D=5120,
2025-05-07T20:32:40.6648646Z     scale_ub=1200.0,
2025-05-07T20:32:40.6648867Z     contiguous=False,
2025-05-07T20:32:40.6649102Z     compiled=True,
2025-05-07T20:32:40.6649310Z )
2025-05-07T20:32:40.8191080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8192372Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.8192917Z 
2025-05-07T20:32:40.8193078Z     @given(
2025-05-07T20:32:40.8193539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8194167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8194787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8195454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8196244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8196807Z     )
2025-05-07T20:32:40.8197798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8198249Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8198493Z         self,
2025-05-07T20:32:40.8198685Z         T: int,
2025-05-07T20:32:40.8199032Z         D: int,
2025-05-07T20:32:40.8199255Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8199524Z         contiguous: bool,
2025-05-07T20:32:40.8199769Z         compiled: bool,
2025-05-07T20:32:40.8199995Z     ) -> None:
2025-05-07T20:32:40.8200216Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8200461Z     
2025-05-07T20:32:40.8200736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8201080Z     
2025-05-07T20:32:40.8201277Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8201571Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8201890Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8202127Z         x0 = x[:, :D]
2025-05-07T20:32:40.8202351Z         x1 = x[:, D:]
2025-05-07T20:32:40.8202567Z     
2025-05-07T20:32:40.8202752Z         if contiguous:
2025-05-07T20:32:40.8202991Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8203253Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8203497Z     
2025-05-07T20:32:40.8203694Z         if scale_ub is not None:
2025-05-07T20:32:40.8203969Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8204303Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8204616Z             )
2025-05-07T20:32:40.8204814Z         else:
2025-05-07T20:32:40.8205024Z             scale_ub_tensor = None
2025-05-07T20:32:40.8205279Z     
2025-05-07T20:32:40.8205516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8205829Z             op = silu_mul_quant
2025-05-07T20:32:40.8206083Z             if compiled:
2025-05-07T20:32:40.8206337Z                 op = torch.compile(op)
2025-05-07T20:32:40.8206641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8206913Z     
2025-05-07T20:32:40.8207126Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8207290Z 
2025-05-07T20:32:40.8207399Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8207693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8208036Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8208322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8208885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.8209451Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.8210115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8210806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8211341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8212026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8212689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8213220Z     kernel = self.compile(
2025-05-07T20:32:40.8213764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8214416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8214815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8215044Z 
2025-05-07T20:32:40.8215252Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8583440>
2025-05-07T20:32:40.8216333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8217805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be858c540>}
2025-05-07T20:32:40.8219160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8220647Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b8c8870>
2025-05-07T20:32:40.8220937Z 
2025-05-07T20:32:40.8221104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8221632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8222112Z                            module_map=module_map)
2025-05-07T20:32:40.8222476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8222843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8223115Z E       ^
2025-05-07T20:32:40.8223585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8224044Z 
2025-05-07T20:32:40.8224469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8224987Z 
2025-05-07T20:32:40.8225096Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8225519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8225925Z     T=1,
2025-05-07T20:32:40.8226119Z     D=5120,
2025-05-07T20:32:40.8226325Z     scale_ub=1200.0,
2025-05-07T20:32:40.8226549Z     contiguous=False,
2025-05-07T20:32:40.8226780Z     compiled=False,
2025-05-07T20:32:40.8226992Z )
2025-05-07T20:32:40.8227305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8227795Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.8228074Z 
2025-05-07T20:32:40.8228154Z     @given(
2025-05-07T20:32:40.8228392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8228704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8229019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8229355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8229679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8229969Z     )
2025-05-07T20:32:40.8230321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8230758Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8231007Z         self,
2025-05-07T20:32:40.8231208Z         T: int,
2025-05-07T20:32:40.8231410Z         D: int,
2025-05-07T20:32:40.8231628Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8231907Z         contiguous: bool,
2025-05-07T20:32:40.8232148Z         compiled: bool,
2025-05-07T20:32:40.8232367Z     ) -> None:
2025-05-07T20:32:40.8232586Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8232838Z     
2025-05-07T20:32:40.8233106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8233458Z     
2025-05-07T20:32:40.8233661Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8233949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8234262Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8234506Z         x0 = x[:, :D]
2025-05-07T20:32:40.8234726Z         x1 = x[:, D:]
2025-05-07T20:32:40.8234942Z     
2025-05-07T20:32:40.8235139Z         if contiguous:
2025-05-07T20:32:40.8235370Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8235635Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8235948Z     
2025-05-07T20:32:40.8236141Z         if scale_ub is not None:
2025-05-07T20:32:40.8236421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8236760Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8237160Z             )
2025-05-07T20:32:40.8237353Z         else:
2025-05-07T20:32:40.8237572Z             scale_ub_tensor = None
2025-05-07T20:32:40.8237836Z     
2025-05-07T20:32:40.8238066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8238465Z             op = silu_mul_quant
2025-05-07T20:32:40.8238725Z             if compiled:
2025-05-07T20:32:40.8238973Z                 op = torch.compile(op)
2025-05-07T20:32:40.8239283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8239565Z     
2025-05-07T20:32:40.8239762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8239932Z 
2025-05-07T20:32:40.8240031Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8240327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8240667Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8240949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8241648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8242339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8242869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8243559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8244221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8244755Z     kernel = self.compile(
2025-05-07T20:32:40.8245289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8245942Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8246344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8246575Z 
2025-05-07T20:32:40.8246785Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be853b5c0>
2025-05-07T20:32:40.8247865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8249243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8b5e0c0>}
2025-05-07T20:32:40.8250612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8251646Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b74e2b0>
2025-05-07T20:32:40.8251937Z 
2025-05-07T20:32:40.8252109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8260751Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8261346Z                            module_map=module_map)
2025-05-07T20:32:40.8261721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8262089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8262349Z E       ^
2025-05-07T20:32:40.8262818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8263272Z 
2025-05-07T20:32:40.8263702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8264216Z 
2025-05-07T20:32:40.8264331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8264743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8265159Z     T=16384,
2025-05-07T20:32:40.8265627Z     D=5120,
2025-05-07T20:32:40.8265842Z     scale_ub=1200.0,
2025-05-07T20:32:40.8266294Z     contiguous=False,
2025-05-07T20:32:40.8266543Z     compiled=True,
2025-05-07T20:32:40.8266751Z )
2025-05-07T20:32:40.9139678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9141408Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.9141988Z 
2025-05-07T20:32:40.9142147Z     @given(
2025-05-07T20:32:40.9142608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9143230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9143844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9144506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9145158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9145723Z     )
2025-05-07T20:32:40.9146410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9147261Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9147509Z         self,
2025-05-07T20:32:40.9147712Z         T: int,
2025-05-07T20:32:40.9147921Z         D: int,
2025-05-07T20:32:40.9148136Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9148427Z         contiguous: bool,
2025-05-07T20:32:40.9148674Z         compiled: bool,
2025-05-07T20:32:40.9148899Z     ) -> None:
2025-05-07T20:32:40.9149125Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9149371Z     
2025-05-07T20:32:40.9149640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9149990Z     
2025-05-07T20:32:40.9150188Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9150477Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9150798Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9151050Z         x0 = x[:, :D]
2025-05-07T20:32:40.9151273Z         x1 = x[:, D:]
2025-05-07T20:32:40.9151481Z     
2025-05-07T20:32:40.9151679Z         if contiguous:
2025-05-07T20:32:40.9151916Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9152175Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9152424Z     
2025-05-07T20:32:40.9152617Z         if scale_ub is not None:
2025-05-07T20:32:40.9152889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9153233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9153550Z             )
2025-05-07T20:32:40.9153743Z         else:
2025-05-07T20:32:40.9153960Z             scale_ub_tensor = None
2025-05-07T20:32:40.9154218Z     
2025-05-07T20:32:40.9154449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9154767Z             op = silu_mul_quant
2025-05-07T20:32:40.9155027Z             if compiled:
2025-05-07T20:32:40.9155272Z                 op = torch.compile(op)
2025-05-07T20:32:40.9155573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9155971Z     
2025-05-07T20:32:40.9156182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9156348Z 
2025-05-07T20:32:40.9156451Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9156760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9157104Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9157383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9157960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.9158536Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.9159201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9159897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9160445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9161133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9161954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9162506Z     kernel = self.compile(
2025-05-07T20:32:40.9163051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9163789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9164187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9164428Z 
2025-05-07T20:32:40.9164648Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8538f20>
2025-05-07T20:32:40.9166077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9167488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be90eb560>}
2025-05-07T20:32:40.9168827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9169866Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfd8cb0>
2025-05-07T20:32:40.9170165Z 
2025-05-07T20:32:40.9170333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9170860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9171327Z                            module_map=module_map)
2025-05-07T20:32:40.9171698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9172058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9172316Z E       ^
2025-05-07T20:32:40.9172791Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9173250Z 
2025-05-07T20:32:40.9173663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9174177Z 
2025-05-07T20:32:40.9174293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9174702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9175109Z     T=2048,
2025-05-07T20:32:40.9175306Z     D=7168,
2025-05-07T20:32:40.9175505Z     scale_ub=1200.0,
2025-05-07T20:32:40.9175728Z     contiguous=False,
2025-05-07T20:32:40.9175960Z     compiled=True,
2025-05-07T20:32:40.9176168Z )
2025-05-07T20:32:40.9176482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9176979Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.9177253Z 
2025-05-07T20:32:40.9177338Z     @given(
2025-05-07T20:32:40.9177571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9177889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9178202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9178528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9178869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9179160Z     )
2025-05-07T20:32:40.9179511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9179949Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9180199Z         self,
2025-05-07T20:32:40.9180405Z         T: int,
2025-05-07T20:32:40.9180607Z         D: int,
2025-05-07T20:32:40.9180837Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9181118Z         contiguous: bool,
2025-05-07T20:32:40.9181359Z         compiled: bool,
2025-05-07T20:32:40.9181587Z     ) -> None:
2025-05-07T20:32:40.9181810Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9182049Z     
2025-05-07T20:32:40.9182470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9182824Z     
2025-05-07T20:32:40.9183016Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9183311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9183738Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9183978Z         x0 = x[:, :D]
2025-05-07T20:32:40.9184203Z         x1 = x[:, D:]
2025-05-07T20:32:40.9184416Z     
2025-05-07T20:32:40.9184603Z         if contiguous:
2025-05-07T20:32:40.9184840Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9185108Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9185355Z     
2025-05-07T20:32:40.9185549Z         if scale_ub is not None:
2025-05-07T20:32:40.9185821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9186157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9186469Z             )
2025-05-07T20:32:40.9186664Z         else:
2025-05-07T20:32:40.9186874Z             scale_ub_tensor = None
2025-05-07T20:32:40.9187138Z     
2025-05-07T20:32:40.9187373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9187685Z             op = silu_mul_quant
2025-05-07T20:32:40.9187946Z             if compiled:
2025-05-07T20:32:40.9188203Z                 op = torch.compile(op)
2025-05-07T20:32:40.9188494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9188769Z     
2025-05-07T20:32:40.9188967Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9189133Z 
2025-05-07T20:32:40.9189236Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9189532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9189864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9190147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9190817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.9191380Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.9192099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9192812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9193352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9194031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9194692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9195217Z     kernel = self.compile(
2025-05-07T20:32:40.9195816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9196468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9196860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9197103Z 
2025-05-07T20:32:40.9197309Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be853be30>
2025-05-07T20:32:40.9198386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9199758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8d6fd80>}
2025-05-07T20:32:40.9201096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9202118Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b7321f0>
2025-05-07T20:32:40.9202412Z 
2025-05-07T20:32:40.9202683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9203212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9203681Z                            module_map=module_map)
2025-05-07T20:32:40.9204147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9204510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9204780Z E       ^
2025-05-07T20:32:40.9205240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9205699Z 
2025-05-07T20:32:40.9206114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9206634Z 
2025-05-07T20:32:41.0371303Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0371947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0372516Z     T=1,
2025-05-07T20:32:41.0372780Z     D=5120,
2025-05-07T20:32:41.0372977Z     scale_ub=None,
2025-05-07T20:32:41.0373209Z     contiguous=False,
2025-05-07T20:32:41.0373448Z     compiled=False,
2025-05-07T20:32:41.0373659Z )
2025-05-07T20:32:41.0374005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0374513Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.0374786Z 
2025-05-07T20:32:41.0374867Z     @given(
2025-05-07T20:32:41.0375111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0375436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0375745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0376085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0376424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0376720Z     )
2025-05-07T20:32:41.0377072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0377567Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0377834Z         self,
2025-05-07T20:32:41.0378035Z         T: int,
2025-05-07T20:32:41.0378245Z         D: int,
2025-05-07T20:32:41.0378476Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0378748Z         contiguous: bool,
2025-05-07T20:32:41.0378998Z         compiled: bool,
2025-05-07T20:32:41.0379234Z     ) -> None:
2025-05-07T20:32:41.0379455Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0379708Z     
2025-05-07T20:32:41.0379989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0380335Z     
2025-05-07T20:32:41.0380535Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0380833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0381142Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0381391Z         x0 = x[:, :D]
2025-05-07T20:32:41.0381617Z         x1 = x[:, D:]
2025-05-07T20:32:41.0381827Z     
2025-05-07T20:32:41.0382018Z         if contiguous:
2025-05-07T20:32:41.0382265Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0382534Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0382779Z     
2025-05-07T20:32:41.0382981Z         if scale_ub is not None:
2025-05-07T20:32:41.0383273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0383611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0383931Z             )
2025-05-07T20:32:41.0384134Z         else:
2025-05-07T20:32:41.0384351Z             scale_ub_tensor = None
2025-05-07T20:32:41.0384619Z     
2025-05-07T20:32:41.0384861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0385178Z             op = silu_mul_quant
2025-05-07T20:32:41.0385446Z             if compiled:
2025-05-07T20:32:41.0385707Z                 op = torch.compile(op)
2025-05-07T20:32:41.0386010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0386304Z     
2025-05-07T20:32:41.0386510Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0387049Z 
2025-05-07T20:32:41.0387168Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0387496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0388020Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0388315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0389015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0389717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0390270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0390966Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0391635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0392179Z     kernel = self.compile(
2025-05-07T20:32:41.0392738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0393395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0393808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0394046Z 
2025-05-07T20:32:41.0394255Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b58d70>
2025-05-07T20:32:41.0395348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0396829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be976bd80>}
2025-05-07T20:32:41.0398233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0399266Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bbd9630>
2025-05-07T20:32:41.0399562Z 
2025-05-07T20:32:41.0399737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0400272Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0400746Z                            module_map=module_map)
2025-05-07T20:32:41.0401118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0401481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0401746Z E       ^
2025-05-07T20:32:41.0402216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0402668Z 
2025-05-07T20:32:41.0403093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0403604Z 
2025-05-07T20:32:41.0403717Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0404133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0404548Z     T=4096,
2025-05-07T20:32:41.0404747Z     D=7168,
2025-05-07T20:32:41.0404944Z     scale_ub=1200.0,
2025-05-07T20:32:41.0405181Z     contiguous=False,
2025-05-07T20:32:41.0405419Z     compiled=False,
2025-05-07T20:32:41.0405628Z )
2025-05-07T20:32:41.0405960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0406470Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.0406749Z 
2025-05-07T20:32:41.0406839Z     @given(
2025-05-07T20:32:41.0407075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0407402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0407813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0408150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0408493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0408873Z     )
2025-05-07T20:32:41.0409221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0409679Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0409935Z         self,
2025-05-07T20:32:41.0410138Z         T: int,
2025-05-07T20:32:41.0410541Z         D: int,
2025-05-07T20:32:41.0410770Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0411042Z         contiguous: bool,
2025-05-07T20:32:41.0411297Z         compiled: bool,
2025-05-07T20:32:41.0411537Z     ) -> None:
2025-05-07T20:32:41.0411757Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0412005Z     
2025-05-07T20:32:41.0412286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0412638Z     
2025-05-07T20:32:41.0412842Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0413140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0413454Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0413697Z         x0 = x[:, :D]
2025-05-07T20:32:41.0413939Z         x1 = x[:, D:]
2025-05-07T20:32:41.0414152Z     
2025-05-07T20:32:41.0414338Z         if contiguous:
2025-05-07T20:32:41.0414582Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0414847Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0415088Z     
2025-05-07T20:32:41.0415290Z         if scale_ub is not None:
2025-05-07T20:32:41.0415570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0415907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0416224Z             )
2025-05-07T20:32:41.0416426Z         else:
2025-05-07T20:32:41.0416641Z             scale_ub_tensor = None
2025-05-07T20:32:41.0416902Z     
2025-05-07T20:32:41.0417142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0417473Z             op = silu_mul_quant
2025-05-07T20:32:41.0417732Z             if compiled:
2025-05-07T20:32:41.0417988Z                 op = torch.compile(op)
2025-05-07T20:32:41.0418291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0418578Z     
2025-05-07T20:32:41.0418784Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0418949Z 
2025-05-07T20:32:41.0419058Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0419354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0419702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0419989Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0420677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0421375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0421924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0422614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0423279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0423821Z     kernel = self.compile(
2025-05-07T20:32:41.0424367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0425022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0425427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0425667Z 
2025-05-07T20:32:41.0425880Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b598b0>
2025-05-07T20:32:41.0427064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0428448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be93a0e00>}
2025-05-07T20:32:41.0429901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0430933Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b999db0>
2025-05-07T20:32:41.0431231Z 
2025-05-07T20:32:41.0431405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0431936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0432410Z                            module_map=module_map)
2025-05-07T20:32:41.0432782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0433155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0433419Z E       ^
2025-05-07T20:32:41.0433893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0434368Z 
2025-05-07T20:32:41.0434788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0435299Z 
2025-05-07T20:32:41.0435415Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0435879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0436291Z     T=16384,
2025-05-07T20:32:41.0436500Z     D=7168,
2025-05-07T20:32:41.0436698Z     scale_ub=None,
2025-05-07T20:32:41.0436921Z     contiguous=True,
2025-05-07T20:32:41.0437158Z     compiled=True,
2025-05-07T20:32:41.0437371Z )
2025-05-07T20:32:41.2242465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2243268Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.2243650Z 
2025-05-07T20:32:41.2243748Z     @given(
2025-05-07T20:32:41.2243987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2244321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2244637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2244967Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2245302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2245596Z     )
2025-05-07T20:32:41.2245949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2246404Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2246656Z         self,
2025-05-07T20:32:41.2246856Z         T: int,
2025-05-07T20:32:41.2247064Z         D: int,
2025-05-07T20:32:41.2247295Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2247567Z         contiguous: bool,
2025-05-07T20:32:41.2247828Z         compiled: bool,
2025-05-07T20:32:41.2248075Z     ) -> None:
2025-05-07T20:32:41.2248297Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2248538Z     
2025-05-07T20:32:41.2248821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2249178Z     
2025-05-07T20:32:41.2257333Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2257675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2257992Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2258248Z         x0 = x[:, :D]
2025-05-07T20:32:41.2258477Z         x1 = x[:, D:]
2025-05-07T20:32:41.2258686Z     
2025-05-07T20:32:41.2258884Z         if contiguous:
2025-05-07T20:32:41.2259132Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2259406Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2259649Z     
2025-05-07T20:32:41.2259843Z         if scale_ub is not None:
2025-05-07T20:32:41.2260131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2260799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2261109Z             )
2025-05-07T20:32:41.2261298Z         else:
2025-05-07T20:32:41.2261520Z             scale_ub_tensor = None
2025-05-07T20:32:41.2261930Z     
2025-05-07T20:32:41.2262176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2262502Z             op = silu_mul_quant
2025-05-07T20:32:41.2262752Z             if compiled:
2025-05-07T20:32:41.2263011Z                 op = torch.compile(op)
2025-05-07T20:32:41.2263317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2263594Z     
2025-05-07T20:32:41.2263797Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2263963Z 
2025-05-07T20:32:41.2264078Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2264384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2264716Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2265007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2265874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2266436Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2267100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2267790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2268328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2269003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2269668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2270205Z     kernel = self.compile(
2025-05-07T20:32:41.2270743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2271405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2271808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2272041Z 
2025-05-07T20:32:41.2272260Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8afc770>
2025-05-07T20:32:41.2273341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2274732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4d8fe0>}
2025-05-07T20:32:41.2276185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2277219Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bbd5ff0>
2025-05-07T20:32:41.2277511Z 
2025-05-07T20:32:41.2277683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2278205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2278690Z                            module_map=module_map)
2025-05-07T20:32:41.2279062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2279417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2279692Z E       ^
2025-05-07T20:32:41.2280169Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2280619Z 
2025-05-07T20:32:41.2281038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2281548Z 
2025-05-07T20:32:41.2281786Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2282206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2282615Z     T=4096,
2025-05-07T20:32:41.2282914Z     D=5120,
2025-05-07T20:32:41.2283114Z     scale_ub=None,
2025-05-07T20:32:41.2283343Z     contiguous=False,
2025-05-07T20:32:41.2283570Z     compiled=True,
2025-05-07T20:32:41.2283785Z )
2025-05-07T20:32:41.2284118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2284628Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.2284903Z 
2025-05-07T20:32:41.2284986Z     @given(
2025-05-07T20:32:41.2285228Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2285553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2285868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2286208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2286554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2286844Z     )
2025-05-07T20:32:41.2287202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2287660Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2287907Z         self,
2025-05-07T20:32:41.2288120Z         T: int,
2025-05-07T20:32:41.2288329Z         D: int,
2025-05-07T20:32:41.2288562Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2288835Z         contiguous: bool,
2025-05-07T20:32:41.2289085Z         compiled: bool,
2025-05-07T20:32:41.2289319Z     ) -> None:
2025-05-07T20:32:41.2289538Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2289793Z     
2025-05-07T20:32:41.2290081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2290426Z     
2025-05-07T20:32:41.2290633Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2290931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2291249Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2291501Z         x0 = x[:, :D]
2025-05-07T20:32:41.2291729Z         x1 = x[:, D:]
2025-05-07T20:32:41.2291938Z     
2025-05-07T20:32:41.2292136Z         if contiguous:
2025-05-07T20:32:41.2292392Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2292653Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2292903Z     
2025-05-07T20:32:41.2293107Z         if scale_ub is not None:
2025-05-07T20:32:41.2293382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2293729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2294046Z             )
2025-05-07T20:32:41.2294246Z         else:
2025-05-07T20:32:41.2294458Z             scale_ub_tensor = None
2025-05-07T20:32:41.2294721Z     
2025-05-07T20:32:41.2294960Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2295277Z             op = silu_mul_quant
2025-05-07T20:32:41.2295534Z             if compiled:
2025-05-07T20:32:41.2295797Z                 op = torch.compile(op)
2025-05-07T20:32:41.2296099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2296384Z     
2025-05-07T20:32:41.2296587Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2296757Z 
2025-05-07T20:32:41.2296861Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2297163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2297518Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2297850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2298412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2298978Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2299645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2300560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2301844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2302544Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2303299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2303835Z     kernel = self.compile(
2025-05-07T20:32:41.2304384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2305047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2305448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2305692Z 
2025-05-07T20:32:41.2305903Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d8c620>
2025-05-07T20:32:41.2306995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2308379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4afb00>}
2025-05-07T20:32:41.2309733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2310757Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bba0830>
2025-05-07T20:32:41.2311054Z 
2025-05-07T20:32:41.2311224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2311756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2312233Z                            module_map=module_map)
2025-05-07T20:32:41.2312608Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2312974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2313244Z E       ^
2025-05-07T20:32:41.2313715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2314172Z 
2025-05-07T20:32:41.2314597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2315109Z 
2025-05-07T20:32:41.3795346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3796194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3796898Z     T=4096,
2025-05-07T20:32:41.3797221Z     D=5120,
2025-05-07T20:32:41.3797545Z     scale_ub=1200.0,
2025-05-07T20:32:41.3797911Z     contiguous=False,
2025-05-07T20:32:41.3798288Z     compiled=False,
2025-05-07T20:32:41.3798627Z )
2025-05-07T20:32:41.3799188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3800049Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.3800540Z 
2025-05-07T20:32:41.3800681Z     @given(
2025-05-07T20:32:41.3801062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3801586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3802107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3802670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3803228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3803722Z     )
2025-05-07T20:32:41.3804322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3805134Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3805544Z         self,
2025-05-07T20:32:41.3805860Z         T: int,
2025-05-07T20:32:41.3806188Z         D: int,
2025-05-07T20:32:41.3806555Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3807378Z         contiguous: bool,
2025-05-07T20:32:41.3807787Z         compiled: bool,
2025-05-07T20:32:41.3808144Z     ) -> None:
2025-05-07T20:32:41.3808484Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3809096Z     
2025-05-07T20:32:41.3809541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3810116Z     
2025-05-07T20:32:41.3810439Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3810925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3811457Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3811851Z         x0 = x[:, :D]
2025-05-07T20:32:41.3812206Z         x1 = x[:, D:]
2025-05-07T20:32:41.3812551Z     
2025-05-07T20:32:41.3812851Z         if contiguous:
2025-05-07T20:32:41.3813240Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3813677Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3814075Z     
2025-05-07T20:32:41.3814393Z         if scale_ub is not None:
2025-05-07T20:32:41.3814871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3815423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3815950Z             )
2025-05-07T20:32:41.3816272Z         else:
2025-05-07T20:32:41.3816625Z             scale_ub_tensor = None
2025-05-07T20:32:41.3817054Z     
2025-05-07T20:32:41.3817447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3817976Z             op = silu_mul_quant
2025-05-07T20:32:41.3818397Z             if compiled:
2025-05-07T20:32:41.3818810Z                 op = torch.compile(op)
2025-05-07T20:32:41.3819301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3819766Z     
2025-05-07T20:32:41.3820084Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3820362Z 
2025-05-07T20:32:41.3820536Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3821033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3821602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3822092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3823299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3824523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3825455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3826659Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3827821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3828727Z     kernel = self.compile(
2025-05-07T20:32:41.3829460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3830354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3830905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3831233Z 
2025-05-07T20:32:41.3831515Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d79490>
2025-05-07T20:32:41.3833003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3834998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4ad3a0>}
2025-05-07T20:32:41.3837033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3838521Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba34b70>
2025-05-07T20:32:41.3839113Z 
2025-05-07T20:32:41.3839382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3840219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3841047Z                            module_map=module_map)
2025-05-07T20:32:41.3841616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3842167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3842576Z E       ^
2025-05-07T20:32:41.3843309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3844051Z 
2025-05-07T20:32:41.3844662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3845467Z 
2025-05-07T20:32:41.3845645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3846294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3846942Z     T=4096,
2025-05-07T20:32:41.3847247Z     D=5120,
2025-05-07T20:32:41.3847585Z     scale_ub=1200.0,
2025-05-07T20:32:41.3847972Z     contiguous=False,
2025-05-07T20:32:41.3848323Z     compiled=True,
2025-05-07T20:32:41.3848644Z )
2025-05-07T20:32:41.3849142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3849973Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.3850424Z 
2025-05-07T20:32:41.3850564Z     @given(
2025-05-07T20:32:41.3850939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3851470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3851981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3852517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3853065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3853553Z     )
2025-05-07T20:32:41.3854092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3854792Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3855173Z         self,
2025-05-07T20:32:41.3855471Z         T: int,
2025-05-07T20:32:41.3855766Z         D: int,
2025-05-07T20:32:41.3856097Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3856529Z         contiguous: bool,
2025-05-07T20:32:41.3856907Z         compiled: bool,
2025-05-07T20:32:41.3857269Z     ) -> None:
2025-05-07T20:32:41.3857626Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3858030Z     
2025-05-07T20:32:41.3858457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3858981Z     
2025-05-07T20:32:41.3859278Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3859725Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3860176Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3860528Z         x0 = x[:, :D]
2025-05-07T20:32:41.3860855Z         x1 = x[:, D:]
2025-05-07T20:32:41.3861174Z     
2025-05-07T20:32:41.3861460Z         if contiguous:
2025-05-07T20:32:41.3861806Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3862194Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3862573Z     
2025-05-07T20:32:41.3862866Z         if scale_ub is not None:
2025-05-07T20:32:41.3863289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3863798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3864258Z             )
2025-05-07T20:32:41.3864551Z         else:
2025-05-07T20:32:41.3864865Z             scale_ub_tensor = None
2025-05-07T20:32:41.3865244Z     
2025-05-07T20:32:41.3865884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3866378Z             op = silu_mul_quant
2025-05-07T20:32:41.3866753Z             if compiled:
2025-05-07T20:32:41.3867135Z                 op = torch.compile(op)
2025-05-07T20:32:41.3867590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3868255Z     
2025-05-07T20:32:41.3868571Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3868834Z 
2025-05-07T20:32:41.3868987Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3869449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3870126Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3870558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3871425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.3872291Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.3873326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3874403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3875223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3876398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3877461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3878323Z     kernel = self.compile(
2025-05-07T20:32:41.3879194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3880329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3881009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3881405Z 
2025-05-07T20:32:41.3881737Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d7b140>
2025-05-07T20:32:41.3883591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3885994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea2f7880>}
2025-05-07T20:32:41.3888239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3890044Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b97adf0>
2025-05-07T20:32:41.3890541Z 
2025-05-07T20:32:41.3890828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3891725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3892537Z                            module_map=module_map)
2025-05-07T20:32:41.3893155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3893742Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3894193Z E       ^
2025-05-07T20:32:41.3894994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3895793Z 
2025-05-07T20:32:41.3896527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3897430Z 
2025-05-07T20:32:41.5044393Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5045195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5045898Z     T=2048,
2025-05-07T20:32:41.5046218Z     D=7168,
2025-05-07T20:32:41.5046529Z     scale_ub=1200.0,
2025-05-07T20:32:41.5046896Z     contiguous=False,
2025-05-07T20:32:41.5047261Z     compiled=False,
2025-05-07T20:32:41.5047599Z )
2025-05-07T20:32:41.5048141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5049381Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.5049870Z 
2025-05-07T20:32:41.5050012Z     @given(
2025-05-07T20:32:41.5050387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5051159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5051685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5052238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5052802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5053289Z     )
2025-05-07T20:32:41.5053881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5054650Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5055056Z         self,
2025-05-07T20:32:41.5055350Z         T: int,
2025-05-07T20:32:41.5055668Z         D: int,
2025-05-07T20:32:41.5056017Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5056443Z         contiguous: bool,
2025-05-07T20:32:41.5056829Z         compiled: bool,
2025-05-07T20:32:41.5057205Z     ) -> None:
2025-05-07T20:32:41.5057549Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5057937Z     
2025-05-07T20:32:41.5058389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5058990Z     
2025-05-07T20:32:41.5059300Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5059783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5060314Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5060711Z         x0 = x[:, :D]
2025-05-07T20:32:41.5061073Z         x1 = x[:, D:]
2025-05-07T20:32:41.5061423Z     
2025-05-07T20:32:41.5061725Z         if contiguous:
2025-05-07T20:32:41.5062110Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5062537Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5062938Z     
2025-05-07T20:32:41.5063260Z         if scale_ub is not None:
2025-05-07T20:32:41.5063713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5064266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5064797Z             )
2025-05-07T20:32:41.5065119Z         else:
2025-05-07T20:32:41.5065834Z             scale_ub_tensor = None
2025-05-07T20:32:41.5066262Z     
2025-05-07T20:32:41.5066664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5067203Z             op = silu_mul_quant
2025-05-07T20:32:41.5067613Z             if compiled:
2025-05-07T20:32:41.5068080Z                 op = torch.compile(op)
2025-05-07T20:32:41.5068579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5069037Z     
2025-05-07T20:32:41.5069355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5069631Z 
2025-05-07T20:32:41.5069802Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5070293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5070864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5071342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5072565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5073782Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5074719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5076023Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5077133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5077910Z     kernel = self.compile(
2025-05-07T20:32:41.5078650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5079547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5080090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5080411Z 
2025-05-07T20:32:41.5080900Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be97f6a20>
2025-05-07T20:32:41.5082442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5084663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5beab019e0>}
2025-05-07T20:32:41.5086604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5088084Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba9ff70>
2025-05-07T20:32:41.5088509Z 
2025-05-07T20:32:41.5088765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5089534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5090278Z                            module_map=module_map)
2025-05-07T20:32:41.5090853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5091402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5091803Z E       ^
2025-05-07T20:32:41.5103392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5104137Z 
2025-05-07T20:32:41.5104808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5105645Z 
2025-05-07T20:32:41.5105840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5106494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5107136Z     T=1,
2025-05-07T20:32:41.5107423Z     D=7168,
2025-05-07T20:32:41.5107737Z     scale_ub=None,
2025-05-07T20:32:41.5108077Z     contiguous=True,
2025-05-07T20:32:41.5108418Z     compiled=False,
2025-05-07T20:32:41.5108743Z )
2025-05-07T20:32:41.5109252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5110024Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.5110447Z 
2025-05-07T20:32:41.5110571Z     @given(
2025-05-07T20:32:41.5110934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5111420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5111910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5112434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5112959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5113409Z     )
2025-05-07T20:32:41.5113964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5114682Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5115059Z         self,
2025-05-07T20:32:41.5115366Z         T: int,
2025-05-07T20:32:41.5115686Z         D: int,
2025-05-07T20:32:41.5116094Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5116533Z         contiguous: bool,
2025-05-07T20:32:41.5116912Z         compiled: bool,
2025-05-07T20:32:41.5117257Z     ) -> None:
2025-05-07T20:32:41.5117602Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5117989Z     
2025-05-07T20:32:41.5118403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5118958Z     
2025-05-07T20:32:41.5119269Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5119716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5120207Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5120586Z         x0 = x[:, :D]
2025-05-07T20:32:41.5120927Z         x1 = x[:, D:]
2025-05-07T20:32:41.5121247Z     
2025-05-07T20:32:41.5121540Z         if contiguous:
2025-05-07T20:32:41.5122042Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5122449Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5122829Z     
2025-05-07T20:32:41.5123132Z         if scale_ub is not None:
2025-05-07T20:32:41.5123557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5124170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5124655Z             )
2025-05-07T20:32:41.5124950Z         else:
2025-05-07T20:32:41.5125278Z             scale_ub_tensor = None
2025-05-07T20:32:41.5125675Z     
2025-05-07T20:32:41.5126028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5126524Z             op = silu_mul_quant
2025-05-07T20:32:41.5126922Z             if compiled:
2025-05-07T20:32:41.5127301Z                 op = torch.compile(op)
2025-05-07T20:32:41.5127767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5128201Z     
2025-05-07T20:32:41.5128506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5128765Z 
2025-05-07T20:32:41.5128933Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5129404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5129932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5130375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5131480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5132591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5133451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5134531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5135593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5136449Z     kernel = self.compile(
2025-05-07T20:32:41.5137310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5138365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5139000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5139369Z 
2025-05-07T20:32:41.5139694Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be97f7980>
2025-05-07T20:32:41.5141421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5143656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5beab5e020>}
2025-05-07T20:32:41.5145836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5147484Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba72bf0>
2025-05-07T20:32:41.5148001Z 
2025-05-07T20:32:41.5148269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5149090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5149839Z                            module_map=module_map)
2025-05-07T20:32:41.5150403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5150946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5151355Z E       ^
2025-05-07T20:32:41.5152091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5152815Z 
2025-05-07T20:32:41.5153581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5154402Z 
2025-05-07T20:32:41.5154564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5155220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5156557Z     T=16384,
2025-05-07T20:32:41.5156856Z     D=7168,
2025-05-07T20:32:41.5157163Z     scale_ub=1200.0,
2025-05-07T20:32:41.5157517Z     contiguous=False,
2025-05-07T20:32:41.5157864Z     compiled=True,
2025-05-07T20:32:41.7555909Z )
2025-05-07T20:32:41.7557009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7558020Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.7558312Z 
2025-05-07T20:32:41.7558407Z     @given(
2025-05-07T20:32:41.7558644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7558972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7559292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7559657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7559996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7560295Z     )
2025-05-07T20:32:41.7560659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7561122Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7561378Z         self,
2025-05-07T20:32:41.7561591Z         T: int,
2025-05-07T20:32:41.7561797Z         D: int,
2025-05-07T20:32:41.7562030Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7562318Z         contiguous: bool,
2025-05-07T20:32:41.7562567Z         compiled: bool,
2025-05-07T20:32:41.7562807Z     ) -> None:
2025-05-07T20:32:41.7563032Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7563281Z     
2025-05-07T20:32:41.7563567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7563921Z     
2025-05-07T20:32:41.7564132Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7564442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7564759Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7565015Z         x0 = x[:, :D]
2025-05-07T20:32:41.7565244Z         x1 = x[:, D:]
2025-05-07T20:32:41.7565725Z     
2025-05-07T20:32:41.7565927Z         if contiguous:
2025-05-07T20:32:41.7566169Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7566444Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7566696Z     
2025-05-07T20:32:41.7566901Z         if scale_ub is not None:
2025-05-07T20:32:41.7567185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7567521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7567837Z             )
2025-05-07T20:32:41.7568037Z         else:
2025-05-07T20:32:41.7568250Z             scale_ub_tensor = None
2025-05-07T20:32:41.7568508Z     
2025-05-07T20:32:41.7568745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7569061Z             op = silu_mul_quant
2025-05-07T20:32:41.7569324Z             if compiled:
2025-05-07T20:32:41.7569579Z                 op = torch.compile(op)
2025-05-07T20:32:41.7569876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7570164Z     
2025-05-07T20:32:41.7570366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7570537Z 
2025-05-07T20:32:41.7570646Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7570944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7571288Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7571574Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7572135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7572706Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7573371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7574399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7574947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7575641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7576455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7576986Z     kernel = self.compile(
2025-05-07T20:32:41.7577534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7578196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7578609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7578844Z 
2025-05-07T20:32:41.7579054Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be985be00>
2025-05-07T20:32:41.7580153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7581645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05131580>}
2025-05-07T20:32:41.7582993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7584028Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bafe1b0>
2025-05-07T20:32:41.7584322Z 
2025-05-07T20:32:41.7584493Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7585025Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7585518Z                            module_map=module_map)
2025-05-07T20:32:41.7585890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7586252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7586545Z E       ^
2025-05-07T20:32:41.7587018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7587471Z 
2025-05-07T20:32:41.7587900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7588414Z 
2025-05-07T20:32:41.7588521Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7588943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7589355Z     T=1,
2025-05-07T20:32:41.7589551Z     D=7168,
2025-05-07T20:32:41.7589750Z     scale_ub=None,
2025-05-07T20:32:41.7589975Z     contiguous=False,
2025-05-07T20:32:41.7590210Z     compiled=False,
2025-05-07T20:32:41.7590424Z )
2025-05-07T20:32:41.7590753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7591256Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.7591529Z 
2025-05-07T20:32:41.7591611Z     @given(
2025-05-07T20:32:41.7591850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7592180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7592488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7592830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7593170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7593461Z     )
2025-05-07T20:32:41.7593811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7594259Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7594514Z         self,
2025-05-07T20:32:41.7594713Z         T: int,
2025-05-07T20:32:41.7595010Z         D: int,
2025-05-07T20:32:41.7595243Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7595517Z         contiguous: bool,
2025-05-07T20:32:41.7595870Z         compiled: bool,
2025-05-07T20:32:41.7596219Z     ) -> None:
2025-05-07T20:32:41.7596435Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7596685Z     
2025-05-07T20:32:41.7596964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7597307Z     
2025-05-07T20:32:41.7597512Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7597815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7598172Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7598409Z         x0 = x[:, :D]
2025-05-07T20:32:41.7598630Z         x1 = x[:, D:]
2025-05-07T20:32:41.7598845Z     
2025-05-07T20:32:41.7599033Z         if contiguous:
2025-05-07T20:32:41.7599276Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7599545Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7599789Z     
2025-05-07T20:32:41.7599998Z         if scale_ub is not None:
2025-05-07T20:32:41.7600281Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7600617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7600948Z             )
2025-05-07T20:32:41.7601159Z         else:
2025-05-07T20:32:41.7601373Z             scale_ub_tensor = None
2025-05-07T20:32:41.7601638Z     
2025-05-07T20:32:41.7601879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7602208Z             op = silu_mul_quant
2025-05-07T20:32:41.7602465Z             if compiled:
2025-05-07T20:32:41.7602725Z                 op = torch.compile(op)
2025-05-07T20:32:41.7603018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7603301Z     
2025-05-07T20:32:41.7603508Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7603672Z 
2025-05-07T20:32:41.7603787Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7604083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7604432Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7604719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7605401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7606100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7606647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7607337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7608041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7608594Z     kernel = self.compile(
2025-05-07T20:32:41.7609141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7609803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7610209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7610447Z 
2025-05-07T20:32:41.7610657Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be985bb90>
2025-05-07T20:32:41.7611747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7613121Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05132840>}
2025-05-07T20:32:41.7614466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7615634Z context = <triton._C.libtriton.ir.context object at 0x7f5be813d4b0>
2025-05-07T20:32:41.7615932Z 
2025-05-07T20:32:41.7616101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7616704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7617175Z                            module_map=module_map)
2025-05-07T20:32:41.7617547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7617956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7618223Z E       ^
2025-05-07T20:32:41.7618692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7619152Z 
2025-05-07T20:32:41.7619567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7620079Z 
2025-05-07T20:32:41.7620192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7620610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7621024Z     T=2048,
2025-05-07T20:32:41.7621224Z     D=7168,
2025-05-07T20:32:41.7621425Z     scale_ub=None,
2025-05-07T20:32:41.7621649Z     contiguous=False,
2025-05-07T20:32:41.7621885Z     compiled=True,
2025-05-07T20:32:41.7622090Z )
2025-05-07T20:32:41.8496878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8497639Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.8498238Z 
2025-05-07T20:32:41.8498410Z     @given(
2025-05-07T20:32:41.8498870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8499511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8500134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8500789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8501482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8502060Z     )
2025-05-07T20:32:41.8502765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8503651Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8504153Z         self,
2025-05-07T20:32:41.8504547Z         T: int,
2025-05-07T20:32:41.8504944Z         D: int,
2025-05-07T20:32:41.8505384Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8505928Z         contiguous: bool,
2025-05-07T20:32:41.8506406Z         compiled: bool,
2025-05-07T20:32:41.8506865Z     ) -> None:
2025-05-07T20:32:41.8507302Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8507764Z     
2025-05-07T20:32:41.8508080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8508477Z     
2025-05-07T20:32:41.8508682Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8508977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8509289Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8509542Z         x0 = x[:, :D]
2025-05-07T20:32:41.8509773Z         x1 = x[:, D:]
2025-05-07T20:32:41.8509982Z     
2025-05-07T20:32:41.8510176Z         if contiguous:
2025-05-07T20:32:41.8510421Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8510689Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8510938Z     
2025-05-07T20:32:41.8511145Z         if scale_ub is not None:
2025-05-07T20:32:41.8511417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8511765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8512087Z             )
2025-05-07T20:32:41.8512286Z         else:
2025-05-07T20:32:41.8512509Z             scale_ub_tensor = None
2025-05-07T20:32:41.8512768Z     
2025-05-07T20:32:41.8513002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8513327Z             op = silu_mul_quant
2025-05-07T20:32:41.8513583Z             if compiled:
2025-05-07T20:32:41.8513837Z                 op = torch.compile(op)
2025-05-07T20:32:41.8514501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8514787Z     
2025-05-07T20:32:41.8514996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8515161Z 
2025-05-07T20:32:41.8515263Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8515791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8516135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8516417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8516985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.8517552Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.8518214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8518899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8519446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8520132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8520795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8521338Z     kernel = self.compile(
2025-05-07T20:32:41.8521883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8522545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8522947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8523189Z 
2025-05-07T20:32:41.8523401Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4cff50>
2025-05-07T20:32:41.8524495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8525885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c0ea80ea0>}
2025-05-07T20:32:41.8527234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8528271Z context = <triton._C.libtriton.ir.context object at 0x7f5be80ef770>
2025-05-07T20:32:41.8528569Z 
2025-05-07T20:32:41.8528738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8529275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8529757Z                            module_map=module_map)
2025-05-07T20:32:41.8530140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8530509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8530784Z E       ^
2025-05-07T20:32:41.8531251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8531717Z 
2025-05-07T20:32:41.8532132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8532650Z 
2025-05-07T20:32:41.8532771Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8533204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8533615Z     T=4096,
2025-05-07T20:32:41.8533819Z     D=7168,
2025-05-07T20:32:41.8534025Z     scale_ub=None,
2025-05-07T20:32:41.8534254Z     contiguous=False,
2025-05-07T20:32:41.8534491Z     compiled=True,
2025-05-07T20:32:41.8534711Z )
2025-05-07T20:32:41.8535037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8535638Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.8535915Z 
2025-05-07T20:32:41.8536003Z     @given(
2025-05-07T20:32:41.8536240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8536636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8536953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8537298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8537632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8537931Z     )
2025-05-07T20:32:41.8538285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8538728Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8538984Z         self,
2025-05-07T20:32:41.8539193Z         T: int,
2025-05-07T20:32:41.8539395Z         D: int,
2025-05-07T20:32:41.8539623Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8539906Z         contiguous: bool,
2025-05-07T20:32:41.8540157Z         compiled: bool,
2025-05-07T20:32:41.8540391Z     ) -> None:
2025-05-07T20:32:41.8540630Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8540881Z     
2025-05-07T20:32:41.8541173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8541536Z     
2025-05-07T20:32:41.8541740Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8542046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8542374Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8542625Z         x0 = x[:, :D]
2025-05-07T20:32:41.8542858Z         x1 = x[:, D:]
2025-05-07T20:32:41.8543082Z     
2025-05-07T20:32:41.8552015Z         if contiguous:
2025-05-07T20:32:41.8552291Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8552558Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8552817Z     
2025-05-07T20:32:41.8553026Z         if scale_ub is not None:
2025-05-07T20:32:41.8553311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8553667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8553981Z             )
2025-05-07T20:32:41.8554192Z         else:
2025-05-07T20:32:41.8554407Z             scale_ub_tensor = None
2025-05-07T20:32:41.8554681Z     
2025-05-07T20:32:41.8554928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8555253Z             op = silu_mul_quant
2025-05-07T20:32:41.8555516Z             if compiled:
2025-05-07T20:32:41.8555833Z                 op = torch.compile(op)
2025-05-07T20:32:41.8556135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8556419Z     
2025-05-07T20:32:41.8556625Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8556797Z 
2025-05-07T20:32:41.8556905Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8557217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8557567Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8557861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8558440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.8559029Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.8559704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8560399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8560949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8561644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8562329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8562876Z     kernel = self.compile(
2025-05-07T20:32:41.8563433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8564229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8564638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8564959Z 
2025-05-07T20:32:41.8565177Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea37a6f0>
2025-05-07T20:32:41.8566573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8567967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05e5b060>}
2025-05-07T20:32:41.8569332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8570373Z context = <triton._C.libtriton.ir.context object at 0x7f5be8136ff0>
2025-05-07T20:32:41.8570677Z 
2025-05-07T20:32:41.8570853Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8571394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8571877Z                            module_map=module_map)
2025-05-07T20:32:41.8572250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8572621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8572894Z E       ^
2025-05-07T20:32:41.8573362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8573827Z 
2025-05-07T20:32:41.8574244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8574762Z 
2025-05-07T20:32:42.0154604Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0155271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0155936Z     T=16384,
2025-05-07T20:32:42.0156235Z     D=5120,
2025-05-07T20:32:42.0156529Z     scale_ub=1200.0,
2025-05-07T20:32:42.0156777Z     contiguous=False,
2025-05-07T20:32:42.0157008Z     compiled=False,
2025-05-07T20:32:42.0157232Z )
2025-05-07T20:32:42.0157568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0158089Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.0158384Z 
2025-05-07T20:32:42.0158469Z     @given(
2025-05-07T20:32:42.0158712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0159036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0159362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0159747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0160095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0160395Z     )
2025-05-07T20:32:42.0160754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0161216Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0161479Z         self,
2025-05-07T20:32:42.0161681Z         T: int,
2025-05-07T20:32:42.0161895Z         D: int,
2025-05-07T20:32:42.0162127Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0162418Z         contiguous: bool,
2025-05-07T20:32:42.0162671Z         compiled: bool,
2025-05-07T20:32:42.0162912Z     ) -> None:
2025-05-07T20:32:42.0163145Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0163394Z     
2025-05-07T20:32:42.0163686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0164048Z     
2025-05-07T20:32:42.0164250Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0164553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0165217Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0165898Z         x0 = x[:, :D]
2025-05-07T20:32:42.0166130Z         x1 = x[:, D:]
2025-05-07T20:32:42.0166350Z     
2025-05-07T20:32:42.0166704Z         if contiguous:
2025-05-07T20:32:42.0166950Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0167223Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0167467Z     
2025-05-07T20:32:42.0167671Z         if scale_ub is not None:
2025-05-07T20:32:42.0167969Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0168314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0168637Z             )
2025-05-07T20:32:42.0168847Z         else:
2025-05-07T20:32:42.0169068Z             scale_ub_tensor = None
2025-05-07T20:32:42.0169325Z     
2025-05-07T20:32:42.0169568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0169902Z             op = silu_mul_quant
2025-05-07T20:32:42.0170159Z             if compiled:
2025-05-07T20:32:42.0170428Z                 op = torch.compile(op)
2025-05-07T20:32:42.0170736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0171020Z     
2025-05-07T20:32:42.0171233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0171409Z 
2025-05-07T20:32:42.0171525Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0171828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0172180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0172474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0173182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0173877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0174431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0175133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0175803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0176351Z     kernel = self.compile(
2025-05-07T20:32:42.0176911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0177578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0178013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0178279Z 
2025-05-07T20:32:42.0178492Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea379670>
2025-05-07T20:32:42.0179589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0180995Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05f09120>}
2025-05-07T20:32:42.0182344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0183393Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b5b17b0>
2025-05-07T20:32:42.0183694Z 
2025-05-07T20:32:42.0183868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0184402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0184879Z                            module_map=module_map)
2025-05-07T20:32:42.0185256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0185626Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0185896Z E       ^
2025-05-07T20:32:42.0186499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0186957Z 
2025-05-07T20:32:42.0187379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0187966Z 
2025-05-07T20:32:42.0188074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0188499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0188917Z     T=16384,
2025-05-07T20:32:42.0189123Z     D=5120,
2025-05-07T20:32:42.0189320Z     scale_ub=1200.0,
2025-05-07T20:32:42.0189551Z     contiguous=True,
2025-05-07T20:32:42.0189785Z     compiled=True,
2025-05-07T20:32:42.0189988Z )
2025-05-07T20:32:42.0190317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0190823Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.0191101Z 
2025-05-07T20:32:42.0191192Z     @given(
2025-05-07T20:32:42.0191432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0191754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0192069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0192407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0192744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0193040Z     )
2025-05-07T20:32:42.0193393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0193849Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0194102Z         self,
2025-05-07T20:32:42.0194300Z         T: int,
2025-05-07T20:32:42.0194505Z         D: int,
2025-05-07T20:32:42.0194731Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0195005Z         contiguous: bool,
2025-05-07T20:32:42.0195263Z         compiled: bool,
2025-05-07T20:32:42.0195494Z     ) -> None:
2025-05-07T20:32:42.0195783Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0196040Z     
2025-05-07T20:32:42.0196322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0196667Z     
2025-05-07T20:32:42.0196874Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0197171Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0197484Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0197733Z         x0 = x[:, :D]
2025-05-07T20:32:42.0197958Z         x1 = x[:, D:]
2025-05-07T20:32:42.0198171Z     
2025-05-07T20:32:42.0198372Z         if contiguous:
2025-05-07T20:32:42.0198619Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0198895Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0199145Z     
2025-05-07T20:32:42.0199350Z         if scale_ub is not None:
2025-05-07T20:32:42.0199634Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0199972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0200294Z             )
2025-05-07T20:32:42.0200503Z         else:
2025-05-07T20:32:42.0200720Z             scale_ub_tensor = None
2025-05-07T20:32:42.0200982Z     
2025-05-07T20:32:42.0201225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0201550Z             op = silu_mul_quant
2025-05-07T20:32:42.0201821Z             if compiled:
2025-05-07T20:32:42.0202081Z                 op = torch.compile(op)
2025-05-07T20:32:42.0202385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0202676Z     
2025-05-07T20:32:42.0202884Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0203050Z 
2025-05-07T20:32:42.0203159Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0203459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0203806Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0204097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0205157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.0205736Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.0206405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0207198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0207745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0208439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0209116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0209653Z     kernel = self.compile(
2025-05-07T20:32:42.0210197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0210867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0211279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0211514Z 
2025-05-07T20:32:42.0211729Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beaa94950>
2025-05-07T20:32:42.0212828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0214209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05e960c0>}
2025-05-07T20:32:42.0215558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0216594Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b8a95b0>
2025-05-07T20:32:42.0216894Z 
2025-05-07T20:32:42.0217064Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0217599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0218092Z                            module_map=module_map)
2025-05-07T20:32:42.0218643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0219011Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0219282Z E       ^
2025-05-07T20:32:42.0219755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0220216Z 
2025-05-07T20:32:42.0220633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0221154Z 
2025-05-07T20:32:42.1934963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1935597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1936018Z     T=16384,
2025-05-07T20:32:42.1936214Z     D=5120,
2025-05-07T20:32:42.1936425Z     scale_ub=None,
2025-05-07T20:32:42.1936659Z     contiguous=False,
2025-05-07T20:32:42.1936893Z     compiled=True,
2025-05-07T20:32:42.1937100Z )
2025-05-07T20:32:42.1937427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1938119Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1938396Z 
2025-05-07T20:32:42.1938477Z     @given(
2025-05-07T20:32:42.1938714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1939033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1939335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1939669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1940001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1940288Z     )
2025-05-07T20:32:42.1940932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1941381Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1941631Z         self,
2025-05-07T20:32:42.1941966Z         T: int,
2025-05-07T20:32:42.1942165Z         D: int,
2025-05-07T20:32:42.1942383Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1942675Z         contiguous: bool,
2025-05-07T20:32:42.1942912Z         compiled: bool,
2025-05-07T20:32:42.1943140Z     ) -> None:
2025-05-07T20:32:42.1943360Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1943597Z     
2025-05-07T20:32:42.1943872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1944221Z     
2025-05-07T20:32:42.1944414Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1944708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1945023Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1945261Z         x0 = x[:, :D]
2025-05-07T20:32:42.1945492Z         x1 = x[:, D:]
2025-05-07T20:32:42.1945704Z     
2025-05-07T20:32:42.1945891Z         if contiguous:
2025-05-07T20:32:42.1946128Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1946391Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1946632Z     
2025-05-07T20:32:42.1946836Z         if scale_ub is not None:
2025-05-07T20:32:42.1947113Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1947450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1947761Z             )
2025-05-07T20:32:42.1947962Z         else:
2025-05-07T20:32:42.1948217Z             scale_ub_tensor = None
2025-05-07T20:32:42.1948479Z     
2025-05-07T20:32:42.1948719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1949039Z             op = silu_mul_quant
2025-05-07T20:32:42.1949286Z             if compiled:
2025-05-07T20:32:42.1949538Z                 op = torch.compile(op)
2025-05-07T20:32:42.1949844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1950122Z     
2025-05-07T20:32:42.1950319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1950483Z 
2025-05-07T20:32:42.1950589Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1950888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1951227Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1951513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1952077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.1952635Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.1953297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1953987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1954525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1955217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1955990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1956531Z     kernel = self.compile(
2025-05-07T20:32:42.1957069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1957727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1958163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1958413Z 
2025-05-07T20:32:42.1958627Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea999e50>
2025-05-07T20:32:42.1959795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1961191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8ae1d00>}
2025-05-07T20:32:42.1962617Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1963646Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be23ab0>
2025-05-07T20:32:42.1963934Z 
2025-05-07T20:32:42.1964108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1964631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1965112Z                            module_map=module_map)
2025-05-07T20:32:42.1965737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1966109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1966378Z E       ^
2025-05-07T20:32:42.1966843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1967309Z 
2025-05-07T20:32:42.1967727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1968237Z 
2025-05-07T20:32:42.1968347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1968760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1969165Z     T=2048,
2025-05-07T20:32:42.1969365Z     D=5120,
2025-05-07T20:32:42.1969570Z     scale_ub=None,
2025-05-07T20:32:42.1969784Z     contiguous=False,
2025-05-07T20:32:42.1970013Z     compiled=True,
2025-05-07T20:32:42.1970221Z )
2025-05-07T20:32:42.4919928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4920768Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.4921054Z 
2025-05-07T20:32:42.4921135Z     @given(
2025-05-07T20:32:42.4921369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4921694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4921998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4922333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4922662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4922946Z     )
2025-05-07T20:32:42.4923300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4923744Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4923986Z         self,
2025-05-07T20:32:42.4924183Z         T: int,
2025-05-07T20:32:42.4924386Z         D: int,
2025-05-07T20:32:42.4924609Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4924879Z         contiguous: bool,
2025-05-07T20:32:42.4925126Z         compiled: bool,
2025-05-07T20:32:42.4925364Z     ) -> None:
2025-05-07T20:32:42.4925581Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4925834Z     
2025-05-07T20:32:42.4926115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4926467Z     
2025-05-07T20:32:42.4926673Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4926972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4927281Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4927530Z         x0 = x[:, :D]
2025-05-07T20:32:42.4927753Z         x1 = x[:, D:]
2025-05-07T20:32:42.4927965Z     
2025-05-07T20:32:42.4928168Z         if contiguous:
2025-05-07T20:32:42.4928405Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4928665Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4928914Z     
2025-05-07T20:32:42.4929125Z         if scale_ub is not None:
2025-05-07T20:32:42.4929401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4930095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4930413Z             )
2025-05-07T20:32:42.4930622Z         else:
2025-05-07T20:32:42.4930834Z             scale_ub_tensor = None
2025-05-07T20:32:42.4931099Z     
2025-05-07T20:32:42.4931491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4931803Z             op = silu_mul_quant
2025-05-07T20:32:42.4932059Z             if compiled:
2025-05-07T20:32:42.4932310Z                 op = torch.compile(op)
2025-05-07T20:32:42.4932609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4932888Z     
2025-05-07T20:32:42.4933094Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.4933258Z 
2025-05-07T20:32:42.4933363Z moe/activation_test.py:117: 
2025-05-07T20:32:42.4933666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4934008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.4934294Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4934857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.4935419Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.4936077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.4936767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.4937311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4938010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4938678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4939211Z     kernel = self.compile(
2025-05-07T20:32:42.4939753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4940422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4940825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4941066Z 
2025-05-07T20:32:42.4941282Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea99bad0>
2025-05-07T20:32:42.4942378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4943769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8ae14e0>}
2025-05-07T20:32:42.4953475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4954538Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be8ac30>
2025-05-07T20:32:42.4954835Z 
2025-05-07T20:32:42.4955012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4955546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4956100Z                            module_map=module_map)
2025-05-07T20:32:42.4956469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4956835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.4957101Z E       ^
2025-05-07T20:32:42.4957568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4958024Z 
2025-05-07T20:32:42.4958445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.4958964Z 
2025-05-07T20:32:42.4959203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4959624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4960039Z     T=2048,
2025-05-07T20:32:42.4960228Z     D=5120,
2025-05-07T20:32:42.4960512Z     scale_ub=1200.0,
2025-05-07T20:32:42.4960743Z     contiguous=False,
2025-05-07T20:32:42.4960965Z     compiled=True,
2025-05-07T20:32:42.4961187Z )
2025-05-07T20:32:42.4961518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4962014Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.4962309Z 
2025-05-07T20:32:42.4962392Z     @given(
2025-05-07T20:32:42.4962632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4962943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4963260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4963602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4963948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4964231Z     )
2025-05-07T20:32:42.4964583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4965035Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4965289Z         self,
2025-05-07T20:32:42.4965772Z         T: int,
2025-05-07T20:32:42.4965979Z         D: int,
2025-05-07T20:32:42.4966195Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4966480Z         contiguous: bool,
2025-05-07T20:32:42.4966728Z         compiled: bool,
2025-05-07T20:32:42.4966950Z     ) -> None:
2025-05-07T20:32:42.4967173Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4967419Z     
2025-05-07T20:32:42.4967689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4968034Z     
2025-05-07T20:32:42.4968238Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4968528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4968843Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4969096Z         x0 = x[:, :D]
2025-05-07T20:32:42.4969316Z         x1 = x[:, D:]
2025-05-07T20:32:42.4969522Z     
2025-05-07T20:32:42.4969715Z         if contiguous:
2025-05-07T20:32:42.4969951Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4970209Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4970443Z     
2025-05-07T20:32:42.4970636Z         if scale_ub is not None:
2025-05-07T20:32:42.4970908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4971241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4971562Z             )
2025-05-07T20:32:42.4971755Z         else:
2025-05-07T20:32:42.4971975Z             scale_ub_tensor = None
2025-05-07T20:32:42.4972231Z     
2025-05-07T20:32:42.4972465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4972786Z             op = silu_mul_quant
2025-05-07T20:32:42.4973042Z             if compiled:
2025-05-07T20:32:42.4973292Z                 op = torch.compile(op)
2025-05-07T20:32:42.4973600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4973880Z     
2025-05-07T20:32:42.4974081Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.4974248Z 
2025-05-07T20:32:42.4974356Z moe/activation_test.py:117: 
2025-05-07T20:32:42.4974661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4975009Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.4975289Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4975855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.4976425Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.4977097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.4977778Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.4978478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4979173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4979830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4980475Z     kernel = self.compile(
2025-05-07T20:32:42.4981021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4981679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4982077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4982313Z 
2025-05-07T20:32:42.4982524Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c142716a0>
2025-05-07T20:32:42.4983617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4985001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be97fdf80>}
2025-05-07T20:32:42.4986342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4987378Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be303b0>
2025-05-07T20:32:42.4987678Z 
2025-05-07T20:32:42.4987846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4988375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4988842Z                            module_map=module_map)
2025-05-07T20:32:42.4989219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4989583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.4989850Z E       ^
2025-05-07T20:32:42.4990312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4990777Z 
2025-05-07T20:32:42.4991191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.4991703Z 
2025-05-07T20:32:42.6730918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6731510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6731937Z     T=4096,
2025-05-07T20:32:42.6732139Z     D=5120,
2025-05-07T20:32:42.6732338Z     scale_ub=1200.0,
2025-05-07T20:32:42.6732576Z     contiguous=True,
2025-05-07T20:32:42.6732814Z     compiled=True,
2025-05-07T20:32:42.6733028Z )
2025-05-07T20:32:42.6733384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6733893Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6734172Z 
2025-05-07T20:32:42.6734263Z     @given(
2025-05-07T20:32:42.6734518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6734849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6735180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6735524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6735869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6736172Z     )
2025-05-07T20:32:42.6736528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6736987Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6737245Z         self,
2025-05-07T20:32:42.6737459Z         T: int,
2025-05-07T20:32:42.6737666Z         D: int,
2025-05-07T20:32:42.6737897Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6738462Z         contiguous: bool,
2025-05-07T20:32:42.6738722Z         compiled: bool,
2025-05-07T20:32:42.6738962Z     ) -> None:
2025-05-07T20:32:42.6739185Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6739449Z     
2025-05-07T20:32:42.6739868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6740220Z     
2025-05-07T20:32:42.6740431Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6740743Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6741061Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6741320Z         x0 = x[:, :D]
2025-05-07T20:32:42.6741549Z         x1 = x[:, D:]
2025-05-07T20:32:42.6741759Z     
2025-05-07T20:32:42.6741959Z         if contiguous:
2025-05-07T20:32:42.6742204Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6742466Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6742717Z     
2025-05-07T20:32:42.6742921Z         if scale_ub is not None:
2025-05-07T20:32:42.6743201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6743550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6743871Z             )
2025-05-07T20:32:42.6744076Z         else:
2025-05-07T20:32:42.6744290Z             scale_ub_tensor = None
2025-05-07T20:32:42.6744560Z     
2025-05-07T20:32:42.6744805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6745122Z             op = silu_mul_quant
2025-05-07T20:32:42.6745384Z             if compiled:
2025-05-07T20:32:42.6745639Z                 op = torch.compile(op)
2025-05-07T20:32:42.6745937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6746222Z     
2025-05-07T20:32:42.6746427Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6746593Z 
2025-05-07T20:32:42.6746697Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6747004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6747348Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6747639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6748201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6748774Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6749445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6750133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6750676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6751369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6752041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6752577Z     kernel = self.compile(
2025-05-07T20:32:42.6753134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6753798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6754201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6754444Z 
2025-05-07T20:32:42.6754654Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c04ab0c50>
2025-05-07T20:32:42.6755843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6757247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be97fe200>}
2025-05-07T20:32:42.6758694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6759725Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b33da30>
2025-05-07T20:32:42.6760023Z 
2025-05-07T20:32:42.6760328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6760861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6761348Z                            module_map=module_map)
2025-05-07T20:32:42.6761712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6762080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6762356Z E       ^
2025-05-07T20:32:42.6762825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6763287Z 
2025-05-07T20:32:42.6763704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6764233Z 
2025-05-07T20:32:42.6764340Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6764761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6765177Z     T=128,
2025-05-07T20:32:42.6765665Z     D=5120,
2025-05-07T20:32:42.6765873Z     scale_ub=1200.0,
2025-05-07T20:32:42.6766098Z     contiguous=False,
2025-05-07T20:32:42.6766328Z     compiled=True,
2025-05-07T20:32:42.6766539Z )
2025-05-07T20:32:42.7782726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7783530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.7783874Z 
2025-05-07T20:32:42.7783959Z     @given(
2025-05-07T20:32:42.7784196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7784524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7784838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7785201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7785535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7785826Z     )
2025-05-07T20:32:42.7786176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7786639Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7786886Z         self,
2025-05-07T20:32:42.7787084Z         T: int,
2025-05-07T20:32:42.7787293Z         D: int,
2025-05-07T20:32:42.7787522Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7787795Z         contiguous: bool,
2025-05-07T20:32:42.7788052Z         compiled: bool,
2025-05-07T20:32:42.7788287Z     ) -> None:
2025-05-07T20:32:42.7788506Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7788760Z     
2025-05-07T20:32:42.7789046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7789395Z     
2025-05-07T20:32:42.7789606Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7789917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7790240Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7790486Z         x0 = x[:, :D]
2025-05-07T20:32:42.7790712Z         x1 = x[:, D:]
2025-05-07T20:32:42.7790931Z     
2025-05-07T20:32:42.7791128Z         if contiguous:
2025-05-07T20:32:42.7791375Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7791649Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7791894Z     
2025-05-07T20:32:42.7792095Z         if scale_ub is not None:
2025-05-07T20:32:42.7792376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7792712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7793041Z             )
2025-05-07T20:32:42.7793249Z         else:
2025-05-07T20:32:42.7793463Z             scale_ub_tensor = None
2025-05-07T20:32:42.7793732Z     
2025-05-07T20:32:42.7793979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7794299Z             op = silu_mul_quant
2025-05-07T20:32:42.7794560Z             if compiled:
2025-05-07T20:32:42.7795133Z                 op = torch.compile(op)
2025-05-07T20:32:42.7795442Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7795818Z     
2025-05-07T20:32:42.7796022Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7796331Z 
2025-05-07T20:32:42.7796441Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7796742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7797088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7797382Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7797941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7798560Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7799226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7799920Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7800463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7801160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7801841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7802373Z     kernel = self.compile(
2025-05-07T20:32:42.7802922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7803587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7803998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7804233Z 
2025-05-07T20:32:42.7804444Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed26c0>
2025-05-07T20:32:42.7805538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7806939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea294c20>}
2025-05-07T20:32:42.7808293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7809331Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b6f4730>
2025-05-07T20:32:42.7809621Z 
2025-05-07T20:32:42.7809791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7810322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7810810Z                            module_map=module_map)
2025-05-07T20:32:42.7811177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7811550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7811821Z E       ^
2025-05-07T20:32:42.7812301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7812754Z 
2025-05-07T20:32:42.7813172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7813692Z 
2025-05-07T20:32:42.7813800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7814224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7814640Z     T=16384,
2025-05-07T20:32:42.7814841Z     D=7168,
2025-05-07T20:32:42.7815048Z     scale_ub=1200.0,
2025-05-07T20:32:42.7815284Z     contiguous=True,
2025-05-07T20:32:42.7815510Z     compiled=True,
2025-05-07T20:32:42.7815730Z )
2025-05-07T20:32:42.7816200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7816700Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7817061Z 
2025-05-07T20:32:42.7817143Z     @given(
2025-05-07T20:32:42.7817388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7817711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7818034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7818375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7818714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7819007Z     )
2025-05-07T20:32:42.7819368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7819819Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7820066Z         self,
2025-05-07T20:32:42.7820269Z         T: int,
2025-05-07T20:32:42.7820478Z         D: int,
2025-05-07T20:32:42.7820708Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7820989Z         contiguous: bool,
2025-05-07T20:32:42.7821236Z         compiled: bool,
2025-05-07T20:32:42.7821460Z     ) -> None:
2025-05-07T20:32:42.7821693Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7821948Z     
2025-05-07T20:32:42.7822226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7822575Z     
2025-05-07T20:32:42.7822775Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7823076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7823403Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7823658Z         x0 = x[:, :D]
2025-05-07T20:32:42.7823881Z         x1 = x[:, D:]
2025-05-07T20:32:42.7824102Z     
2025-05-07T20:32:42.7824299Z         if contiguous:
2025-05-07T20:32:42.7824537Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7824806Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7825060Z     
2025-05-07T20:32:42.7825257Z         if scale_ub is not None:
2025-05-07T20:32:42.7825545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7825892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7826210Z             )
2025-05-07T20:32:42.7826420Z         else:
2025-05-07T20:32:42.7826645Z             scale_ub_tensor = None
2025-05-07T20:32:42.7826908Z     
2025-05-07T20:32:42.7827149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7827478Z             op = silu_mul_quant
2025-05-07T20:32:42.7827739Z             if compiled:
2025-05-07T20:32:42.7827992Z                 op = torch.compile(op)
2025-05-07T20:32:42.7828305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7828594Z     
2025-05-07T20:32:42.7828792Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7828964Z 
2025-05-07T20:32:42.7829068Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7829370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7829719Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7830009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7830580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7831163Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7831827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7832536Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7833085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7833776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7834450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7835001Z     kernel = self.compile(
2025-05-07T20:32:42.7835645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7836391Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7836879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7837121Z 
2025-05-07T20:32:42.7837335Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d21f10>
2025-05-07T20:32:42.7838425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7839797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea296a20>}
2025-05-07T20:32:42.7841159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7842199Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b3cabb0>
2025-05-07T20:32:42.7842495Z 
2025-05-07T20:32:42.7842675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7843215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7843690Z                            module_map=module_map)
2025-05-07T20:32:42.7844071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7844442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7844708Z E       ^
2025-05-07T20:32:42.7845191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7845646Z 
2025-05-07T20:32:42.7846079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7846596Z 
2025-05-07T20:32:42.9080801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9081447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9082065Z     T=16384,
2025-05-07T20:32:42.9082275Z     D=5120,
2025-05-07T20:32:42.9082475Z     scale_ub=1200.0,
2025-05-07T20:32:42.9082697Z     contiguous=True,
2025-05-07T20:32:42.9082926Z     compiled=False,
2025-05-07T20:32:42.9083135Z )
2025-05-07T20:32:42.9083455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9083966Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.9084250Z 
2025-05-07T20:32:42.9084345Z     @given(
2025-05-07T20:32:42.9084606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9084929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9085250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9085587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9085922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9086211Z     )
2025-05-07T20:32:42.9086566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9087036Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9087277Z         self,
2025-05-07T20:32:42.9087486Z         T: int,
2025-05-07T20:32:42.9087694Z         D: int,
2025-05-07T20:32:42.9087919Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9088206Z         contiguous: bool,
2025-05-07T20:32:42.9088452Z         compiled: bool,
2025-05-07T20:32:42.9088683Z     ) -> None:
2025-05-07T20:32:42.9097039Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9097312Z     
2025-05-07T20:32:42.9097594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9097955Z     
2025-05-07T20:32:42.9098162Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9098794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9099126Z         x = x_sign * x_clamp
2025-05-07T20:32:42.9099380Z         x0 = x[:, :D]
2025-05-07T20:32:42.9099767Z         x1 = x[:, D:]
2025-05-07T20:32:42.9099994Z     
2025-05-07T20:32:42.9100182Z         if contiguous:
2025-05-07T20:32:42.9100436Z             x0 = x0.contiguous()
2025-05-07T20:32:42.9100704Z             x1 = x1.contiguous()
2025-05-07T20:32:42.9100945Z     
2025-05-07T20:32:42.9101147Z         if scale_ub is not None:
2025-05-07T20:32:42.9101436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.9101785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.9102103Z             )
2025-05-07T20:32:42.9102308Z         else:
2025-05-07T20:32:42.9102531Z             scale_ub_tensor = None
2025-05-07T20:32:42.9102791Z     
2025-05-07T20:32:42.9103042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9103374Z             op = silu_mul_quant
2025-05-07T20:32:42.9103627Z             if compiled:
2025-05-07T20:32:42.9103886Z                 op = torch.compile(op)
2025-05-07T20:32:42.9104192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9104476Z     
2025-05-07T20:32:42.9104682Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.9104849Z 
2025-05-07T20:32:42.9104964Z moe/activation_test.py:117: 
2025-05-07T20:32:42.9105262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9105609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.9105902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9106606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.9107293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.9107834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.9108528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.9109198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.9109735Z     kernel = self.compile(
2025-05-07T20:32:42.9110281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.9110944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.9111343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9111586Z 
2025-05-07T20:32:42.9111796Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d22780>
2025-05-07T20:32:42.9112893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.9114286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19f9c0>}
2025-05-07T20:32:42.9115645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.9116771Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2bf730>
2025-05-07T20:32:42.9117069Z 
2025-05-07T20:32:42.9117239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.9117765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.9118244Z                            module_map=module_map)
2025-05-07T20:32:42.9118607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.9119059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.9119328Z E       ^
2025-05-07T20:32:42.9119795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.9120328Z 
2025-05-07T20:32:42.9120746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9121267Z 
2025-05-07T20:32:42.9121375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9121804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9122220Z     T=1,
2025-05-07T20:32:42.9122421Z     D=7168,
2025-05-07T20:32:42.9122626Z     scale_ub=1200.0,
2025-05-07T20:32:42.9122861Z     contiguous=False,
2025-05-07T20:32:42.9123099Z     compiled=False,
2025-05-07T20:32:42.9123319Z )
2025-05-07T20:32:42.9123647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9124163Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.9124441Z 
2025-05-07T20:32:42.9124536Z     @given(
2025-05-07T20:32:42.9124769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9125103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9125425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9125769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9126105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9126414Z     )
2025-05-07T20:32:42.9126775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9127230Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9127485Z         self,
2025-05-07T20:32:42.9127690Z         T: int,
2025-05-07T20:32:42.9127891Z         D: int,
2025-05-07T20:32:42.9128128Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9128416Z         contiguous: bool,
2025-05-07T20:32:42.9128672Z         compiled: bool,
2025-05-07T20:32:42.9128907Z     ) -> None:
2025-05-07T20:32:42.9129143Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9129391Z     
2025-05-07T20:32:42.9129673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9130042Z     
2025-05-07T20:32:42.9130250Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9130549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9130869Z         x = x_sign * x_clamp
2025-05-07T20:32:42.9131125Z         x0 = x[:, :D]
2025-05-07T20:32:42.9131352Z         x1 = x[:, D:]
2025-05-07T20:32:42.9131570Z     
2025-05-07T20:32:42.9131767Z         if contiguous:
2025-05-07T20:32:42.9132002Z             x0 = x0.contiguous()
2025-05-07T20:32:42.9132270Z             x1 = x1.contiguous()
2025-05-07T20:32:42.9132513Z     
2025-05-07T20:32:42.9132703Z         if scale_ub is not None:
2025-05-07T20:32:42.9132989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.9133340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.9133653Z             )
2025-05-07T20:32:42.9133853Z         else:
2025-05-07T20:32:42.9134074Z             scale_ub_tensor = None
2025-05-07T20:32:42.9134331Z     
2025-05-07T20:32:42.9134572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9134893Z             op = silu_mul_quant
2025-05-07T20:32:42.9135143Z             if compiled:
2025-05-07T20:32:42.9135392Z                 op = torch.compile(op)
2025-05-07T20:32:42.9135704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9135992Z     
2025-05-07T20:32:42.9136191Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.9136364Z 
2025-05-07T20:32:42.9136471Z moe/activation_test.py:117: 
2025-05-07T20:32:42.9136780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9137124Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.9137422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9138204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.9138954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.9139571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.9140265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.9140946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.9141481Z     kernel = self.compile(
2025-05-07T20:32:42.9142036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.9142701Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.9143114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9143353Z 
2025-05-07T20:32:42.9143568Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b8c3b0>
2025-05-07T20:32:42.9144655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.9146044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c5800>}
2025-05-07T20:32:42.9147396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.9148425Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b25ebf0>
2025-05-07T20:32:42.9148774Z 
2025-05-07T20:32:42.9148949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.9149491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.9149963Z                            module_map=module_map)
2025-05-07T20:32:42.9150346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.9150710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.9150972Z E       ^
2025-05-07T20:32:42.9151441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.9151907Z 
2025-05-07T20:32:42.9152327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9152837Z 
2025-05-07T20:32:43.0903504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0904125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0904725Z     T=4096,
2025-05-07T20:32:43.0905009Z     D=7168,
2025-05-07T20:32:43.0905285Z     scale_ub=1200.0,
2025-05-07T20:32:43.0905521Z     contiguous=False,
2025-05-07T20:32:43.0905754Z     compiled=True,
2025-05-07T20:32:43.0905975Z )
2025-05-07T20:32:43.0906313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.0906809Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.0907091Z 
2025-05-07T20:32:43.0907171Z     @given(
2025-05-07T20:32:43.0907404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.0907725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.0908031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.0908363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.0908695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.0908979Z     )
2025-05-07T20:32:43.0909333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.0910127Z     def test_silu_mul_quant(
2025-05-07T20:32:43.0910379Z         self,
2025-05-07T20:32:43.0910581Z         T: int,
2025-05-07T20:32:43.0910785Z         D: int,
2025-05-07T20:32:43.0911150Z         scale_ub: Optional[float],
2025-05-07T20:32:43.0911431Z         contiguous: bool,
2025-05-07T20:32:43.0911674Z         compiled: bool,
2025-05-07T20:32:43.0911899Z     ) -> None:
2025-05-07T20:32:43.0912119Z         torch.manual_seed(2025)
2025-05-07T20:32:43.0912371Z     
2025-05-07T20:32:43.0912652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.0913014Z     
2025-05-07T20:32:43.0913218Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.0913518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.0913828Z         x = x_sign * x_clamp
2025-05-07T20:32:43.0914085Z         x0 = x[:, :D]
2025-05-07T20:32:43.0914317Z         x1 = x[:, D:]
2025-05-07T20:32:43.0914523Z     
2025-05-07T20:32:43.0914718Z         if contiguous:
2025-05-07T20:32:43.0914960Z             x0 = x0.contiguous()
2025-05-07T20:32:43.0915223Z             x1 = x1.contiguous()
2025-05-07T20:32:43.0915473Z     
2025-05-07T20:32:43.0915679Z         if scale_ub is not None:
2025-05-07T20:32:43.0916070Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.0916410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.0916728Z             )
2025-05-07T20:32:43.0916921Z         else:
2025-05-07T20:32:43.0917137Z             scale_ub_tensor = None
2025-05-07T20:32:43.0917394Z     
2025-05-07T20:32:43.0917624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.0917944Z             op = silu_mul_quant
2025-05-07T20:32:43.0918200Z             if compiled:
2025-05-07T20:32:43.0918449Z                 op = torch.compile(op)
2025-05-07T20:32:43.0918742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0919020Z     
2025-05-07T20:32:43.0919218Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.0919386Z 
2025-05-07T20:32:43.0919488Z moe/activation_test.py:117: 
2025-05-07T20:32:43.0919790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0920129Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.0920417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0920982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.0921547Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.0922207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.0922891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.0923429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.0924112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.0924775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.0925309Z     kernel = self.compile(
2025-05-07T20:32:43.0925850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.0926508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.0926903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0927141Z 
2025-05-07T20:32:43.0927351Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be96d8e90>
2025-05-07T20:32:43.0928437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.0929921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c6d40>}
2025-05-07T20:32:43.0931265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.0932377Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b576630>
2025-05-07T20:32:43.0932671Z 
2025-05-07T20:32:43.0932838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.0933362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.0933833Z                            module_map=module_map)
2025-05-07T20:32:43.0934203Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.0934568Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.0934837Z E       ^
2025-05-07T20:32:43.0935303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.0935765Z 
2025-05-07T20:32:43.0936179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.0936694Z 
2025-05-07T20:32:43.0936808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0937226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0937628Z     T=128,
2025-05-07T20:32:43.0937825Z     D=7168,
2025-05-07T20:32:43.0938027Z     scale_ub=1200.0,
2025-05-07T20:32:43.0938253Z     contiguous=False,
2025-05-07T20:32:43.0938492Z     compiled=True,
2025-05-07T20:32:43.0938706Z )
2025-05-07T20:32:43.1861402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1862921Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.1863663Z 
2025-05-07T20:32:43.1863827Z     @given(
2025-05-07T20:32:43.1864329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1864961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1866023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1866703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1867364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1867927Z     )
2025-05-07T20:32:43.1868563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1869062Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1869308Z         self,
2025-05-07T20:32:43.1869501Z         T: int,
2025-05-07T20:32:43.1869705Z         D: int,
2025-05-07T20:32:43.1869929Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1870199Z         contiguous: bool,
2025-05-07T20:32:43.1870448Z         compiled: bool,
2025-05-07T20:32:43.1870678Z     ) -> None:
2025-05-07T20:32:43.1870896Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1871153Z     
2025-05-07T20:32:43.1871431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1871775Z     
2025-05-07T20:32:43.1871977Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1872279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1872589Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1872836Z         x0 = x[:, :D]
2025-05-07T20:32:43.1873058Z         x1 = x[:, D:]
2025-05-07T20:32:43.1873264Z     
2025-05-07T20:32:43.1873458Z         if contiguous:
2025-05-07T20:32:43.1873695Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1873958Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1874198Z     
2025-05-07T20:32:43.1874397Z         if scale_ub is not None:
2025-05-07T20:32:43.1874683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1875019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1875339Z             )
2025-05-07T20:32:43.1875544Z         else:
2025-05-07T20:32:43.1876133Z             scale_ub_tensor = None
2025-05-07T20:32:43.1876399Z     
2025-05-07T20:32:43.1876646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1876964Z             op = silu_mul_quant
2025-05-07T20:32:43.1877364Z             if compiled:
2025-05-07T20:32:43.1877621Z                 op = torch.compile(op)
2025-05-07T20:32:43.1877922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1878208Z     
2025-05-07T20:32:43.1878415Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1878581Z 
2025-05-07T20:32:43.1878683Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1878990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1879333Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1879625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1880189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1880768Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1881433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1882126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1882666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1883349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1884018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1884549Z     kernel = self.compile(
2025-05-07T20:32:43.1885092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1885749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1886158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1886390Z 
2025-05-07T20:32:43.1886598Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9826db0>
2025-05-07T20:32:43.1887683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1889097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04ae2160>}
2025-05-07T20:32:43.1890449Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1891474Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b57f7f0>
2025-05-07T20:32:43.1891771Z 
2025-05-07T20:32:43.1891944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1892475Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1892951Z                            module_map=module_map)
2025-05-07T20:32:43.1893326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1893691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1893954Z E       ^
2025-05-07T20:32:43.1894417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1894876Z 
2025-05-07T20:32:43.1895293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1895807Z 
2025-05-07T20:32:43.1895923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1896346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1896840Z     T=2048,
2025-05-07T20:32:43.1897040Z     D=7168,
2025-05-07T20:32:43.1897243Z     scale_ub=None,
2025-05-07T20:32:43.1897461Z     contiguous=True,
2025-05-07T20:32:43.1897799Z     compiled=True,
2025-05-07T20:32:43.1898010Z )
2025-05-07T20:32:43.1898329Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1898827Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1899097Z 
2025-05-07T20:32:43.1899186Z     @given(
2025-05-07T20:32:43.1899419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1899740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1900054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1900400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1900733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1901023Z     )
2025-05-07T20:32:43.1901388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1901829Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1902081Z         self,
2025-05-07T20:32:43.1902283Z         T: int,
2025-05-07T20:32:43.1902490Z         D: int,
2025-05-07T20:32:43.1902717Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1902999Z         contiguous: bool,
2025-05-07T20:32:43.1903247Z         compiled: bool,
2025-05-07T20:32:43.1903482Z     ) -> None:
2025-05-07T20:32:43.1903708Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1903955Z     
2025-05-07T20:32:43.1904233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1904595Z     
2025-05-07T20:32:43.1904789Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1905092Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1905425Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1905677Z         x0 = x[:, :D]
2025-05-07T20:32:43.1905902Z         x1 = x[:, D:]
2025-05-07T20:32:43.1906126Z     
2025-05-07T20:32:43.1906318Z         if contiguous:
2025-05-07T20:32:43.1906556Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1906828Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1907086Z     
2025-05-07T20:32:43.1907281Z         if scale_ub is not None:
2025-05-07T20:32:43.1907564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1907916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1908229Z             )
2025-05-07T20:32:43.1908439Z         else:
2025-05-07T20:32:43.1908680Z             scale_ub_tensor = None
2025-05-07T20:32:43.1908958Z     
2025-05-07T20:32:43.1909200Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1909530Z             op = silu_mul_quant
2025-05-07T20:32:43.1909781Z             if compiled:
2025-05-07T20:32:43.1910045Z                 op = torch.compile(op)
2025-05-07T20:32:43.1910355Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1910635Z     
2025-05-07T20:32:43.1910837Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1911013Z 
2025-05-07T20:32:43.1911115Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1911416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1911754Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1912047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1912610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1913170Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1913852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1914553Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1915108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1915954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1916630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1917171Z     kernel = self.compile(
2025-05-07T20:32:43.1917784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1926378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1926822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1927062Z 
2025-05-07T20:32:43.1927280Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be93a9790>
2025-05-07T20:32:43.1928373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1929768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8bdd18a0>}
2025-05-07T20:32:43.1931128Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1932171Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2d3cf0>
2025-05-07T20:32:43.1932463Z 
2025-05-07T20:32:43.1932644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1933168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1933646Z                            module_map=module_map)
2025-05-07T20:32:43.1934014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1934379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1934647Z E       ^
2025-05-07T20:32:43.1935122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1935589Z 
2025-05-07T20:32:43.1936010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1936522Z 
2025-05-07T20:32:43.2538381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2539588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2540802Z     T=16384,
2025-05-07T20:32:43.2541351Z     D=5120,
2025-05-07T20:32:43.2541891Z     scale_ub=None,
2025-05-07T20:32:43.2542476Z     contiguous=False,
2025-05-07T20:32:43.2543042Z     compiled=False,
2025-05-07T20:32:43.2543465Z )
2025-05-07T20:32:43.2544102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2545112Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2545710Z 
2025-05-07T20:32:43.2545872Z     @given(
2025-05-07T20:32:43.2546339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2546965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2547601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2548260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2548762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2549091Z     )
2025-05-07T20:32:43.2549446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2549895Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2550155Z         self,
2025-05-07T20:32:43.2550363Z         T: int,
2025-05-07T20:32:43.2550562Z         D: int,
2025-05-07T20:32:43.2550792Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2551073Z         contiguous: bool,
2025-05-07T20:32:43.2551322Z         compiled: bool,
2025-05-07T20:32:43.2551551Z     ) -> None:
2025-05-07T20:32:43.2552085Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2552345Z     
2025-05-07T20:32:43.2552620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2553102Z     
2025-05-07T20:32:43.2553307Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2553600Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2555644Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2557649Z 
2025-05-07T20:32:43.2557776Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2557997Z 
2025-05-07T20:32:43.2558107Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2558528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2558948Z     T=4096,
2025-05-07T20:32:43.2559152Z     D=7168,
2025-05-07T20:32:43.2559361Z     scale_ub=1200.0,
2025-05-07T20:32:43.2559587Z     contiguous=True,
2025-05-07T20:32:43.2559822Z     compiled=True,
2025-05-07T20:32:43.2560045Z )
2025-05-07T20:32:43.2560369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2560878Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2561150Z 
2025-05-07T20:32:43.2561239Z     @given(
2025-05-07T20:32:43.2561471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2561797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2562116Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2562459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2562788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2563084Z     )
2025-05-07T20:32:43.2563447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2563893Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2564146Z         self,
2025-05-07T20:32:43.2564354Z         T: int,
2025-05-07T20:32:43.2564558Z         D: int,
2025-05-07T20:32:43.2564787Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2565065Z         contiguous: bool,
2025-05-07T20:32:43.2565310Z         compiled: bool,
2025-05-07T20:32:43.2565815Z     ) -> None:
2025-05-07T20:32:43.2566038Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2566284Z     
2025-05-07T20:32:43.2566569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2566925Z     
2025-05-07T20:32:43.2567136Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2567435Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2569512Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2571396Z 
2025-05-07T20:32:43.2571515Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2571730Z 
2025-05-07T20:32:43.2571851Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2572263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2572801Z     T=16384,
2025-05-07T20:32:43.2573014Z     D=7168,
2025-05-07T20:32:43.2573217Z     scale_ub=None,
2025-05-07T20:32:43.2573435Z     contiguous=False,
2025-05-07T20:32:43.2573674Z     compiled=False,
2025-05-07T20:32:43.2574003Z )
2025-05-07T20:32:43.2574322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2574835Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2575116Z 
2025-05-07T20:32:43.2575206Z     @given(
2025-05-07T20:32:43.2575440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2575766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2576084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2576419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2576758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2577062Z     )
2025-05-07T20:32:43.2577433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2577882Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2578137Z         self,
2025-05-07T20:32:43.2578340Z         T: int,
2025-05-07T20:32:43.2578547Z         D: int,
2025-05-07T20:32:43.2578786Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2579069Z         contiguous: bool,
2025-05-07T20:32:43.2579311Z         compiled: bool,
2025-05-07T20:32:43.2579544Z     ) -> None:
2025-05-07T20:32:43.2579789Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2580039Z     
2025-05-07T20:32:43.2580315Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2582389Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2584275Z 
2025-05-07T20:32:43.2584396Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2584609Z 
2025-05-07T20:32:43.2584720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2585134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2585544Z     T=2048,
2025-05-07T20:32:43.2585738Z     D=7168,
2025-05-07T20:32:43.2585938Z     scale_ub=1200.0,
2025-05-07T20:32:43.2586158Z     contiguous=True,
2025-05-07T20:32:43.2586390Z     compiled=True,
2025-05-07T20:32:43.2586600Z )
2025-05-07T20:32:43.2586921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2587417Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2587694Z 
2025-05-07T20:32:43.2587789Z     @given(
2025-05-07T20:32:43.2588022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2588347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2588669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2588996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2589335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2589628Z     )
2025-05-07T20:32:43.2589975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2590427Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2590678Z         self,
2025-05-07T20:32:43.2590880Z         T: int,
2025-05-07T20:32:43.2591077Z         D: int,
2025-05-07T20:32:43.2591300Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2591576Z         contiguous: bool,
2025-05-07T20:32:43.2591816Z         compiled: bool,
2025-05-07T20:32:43.2592044Z     ) -> None:
2025-05-07T20:32:43.2592354Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2592597Z     
2025-05-07T20:32:43.2592871Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2593216Z     
2025-05-07T20:32:43.2593486Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2593778Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2595831Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2597683Z 
2025-05-07T20:32:43.2597817Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2598030Z 
2025-05-07T20:32:43.2598141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2598556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2598973Z     T=2048,
2025-05-07T20:32:43.2599169Z     D=7168,
2025-05-07T20:32:43.2599359Z     scale_ub=None,
2025-05-07T20:32:43.2599577Z     contiguous=True,
2025-05-07T20:32:43.2599803Z     compiled=False,
2025-05-07T20:32:43.2600006Z )
2025-05-07T20:32:43.3727341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3728110Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.3728484Z 
2025-05-07T20:32:43.3728600Z     @given(
2025-05-07T20:32:43.3728830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3729150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3729464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3729832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3730161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3730463Z     )
2025-05-07T20:32:43.3730819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3731279Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3731533Z         self,
2025-05-07T20:32:43.3731738Z         T: int,
2025-05-07T20:32:43.3731937Z         D: int,
2025-05-07T20:32:43.3732160Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3732437Z         contiguous: bool,
2025-05-07T20:32:43.3732678Z         compiled: bool,
2025-05-07T20:32:43.3732910Z     ) -> None:
2025-05-07T20:32:43.3733136Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3733381Z     
2025-05-07T20:32:43.3733659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3734012Z     
2025-05-07T20:32:43.3734219Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.3736189Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.3738067Z 
2025-05-07T20:32:43.3738185Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.3738404Z 
2025-05-07T20:32:43.3738512Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3738985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3739385Z     T=1,
2025-05-07T20:32:43.3739577Z     D=7168,
2025-05-07T20:32:43.3739773Z     scale_ub=1200.0,
2025-05-07T20:32:43.3740300Z     contiguous=True,
2025-05-07T20:32:43.3740523Z     compiled=False,
2025-05-07T20:32:43.3740731Z )
2025-05-07T20:32:43.3741054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3741700Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.3741970Z 
2025-05-07T20:32:43.3742050Z     @given(
2025-05-07T20:32:43.3742287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3742597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3742907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3743240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3743565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3743858Z     )
2025-05-07T20:32:43.3744209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3744653Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3744897Z         self,
2025-05-07T20:32:43.3745103Z         T: int,
2025-05-07T20:32:43.3745307Z         D: int,
2025-05-07T20:32:43.3745526Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3745800Z         contiguous: bool,
2025-05-07T20:32:43.3746049Z         compiled: bool,
2025-05-07T20:32:43.3746269Z     ) -> None:
2025-05-07T20:32:43.3746487Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3746732Z     
2025-05-07T20:32:43.3747001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3747346Z     
2025-05-07T20:32:43.3747546Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.3747837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.3748155Z         x = x_sign * x_clamp
2025-05-07T20:32:43.3748405Z         x0 = x[:, :D]
2025-05-07T20:32:43.3748622Z         x1 = x[:, D:]
2025-05-07T20:32:43.3748838Z     
2025-05-07T20:32:43.3749032Z         if contiguous:
2025-05-07T20:32:43.3749273Z             x0 = x0.contiguous()
2025-05-07T20:32:43.3749545Z             x1 = x1.contiguous()
2025-05-07T20:32:43.3749792Z     
2025-05-07T20:32:43.3750000Z         if scale_ub is not None:
2025-05-07T20:32:43.3750277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.3750623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.3750944Z             )
2025-05-07T20:32:43.3751141Z         else:
2025-05-07T20:32:43.3751360Z             scale_ub_tensor = None
2025-05-07T20:32:43.3751618Z     
2025-05-07T20:32:43.3751857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.3752179Z             op = silu_mul_quant
2025-05-07T20:32:43.3752433Z             if compiled:
2025-05-07T20:32:43.3752679Z                 op = torch.compile(op)
2025-05-07T20:32:43.3752978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3753256Z     
2025-05-07T20:32:43.3753449Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.3753618Z 
2025-05-07T20:32:43.3753720Z moe/activation_test.py:117: 
2025-05-07T20:32:43.3754025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.3754362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.3754648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3755354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.3756150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.3756685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.3757372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.3758041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.3758582Z     kernel = self.compile(
2025-05-07T20:32:43.3759207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.3759873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.3760280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.3760593Z 
2025-05-07T20:32:43.3760809Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea31a210>
2025-05-07T20:32:43.3761890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.3763263Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b260f40>}
2025-05-07T20:32:43.3764614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.3765926Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2e49b0>
2025-05-07T20:32:43.3766342Z 
2025-05-07T20:32:43.3766519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.3767050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.3767528Z                            module_map=module_map)
2025-05-07T20:32:43.3767895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.3768252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.3768521Z E       ^
2025-05-07T20:32:43.3768995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.3769449Z 
2025-05-07T20:32:43.3769869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.3770395Z 
2025-05-07T20:32:43.3770501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3770923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3771340Z     T=128,
2025-05-07T20:32:43.3771532Z     D=5120,
2025-05-07T20:32:43.3771734Z     scale_ub=None,
2025-05-07T20:32:43.3771952Z     contiguous=True,
2025-05-07T20:32:43.3772179Z     compiled=False,
2025-05-07T20:32:43.3772393Z )
2025-05-07T20:32:43.4451491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.4452202Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.4452578Z 
2025-05-07T20:32:43.4452685Z     @given(
2025-05-07T20:32:43.4452990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.4453405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.4453799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.4454212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.4454549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.4454834Z     )
2025-05-07T20:32:43.4455192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.4455648Z     def test_silu_mul_quant(
2025-05-07T20:32:43.4455891Z         self,
2025-05-07T20:32:43.4456083Z         T: int,
2025-05-07T20:32:43.4456285Z         D: int,
2025-05-07T20:32:43.4456509Z         scale_ub: Optional[float],
2025-05-07T20:32:43.4456781Z         contiguous: bool,
2025-05-07T20:32:43.4457025Z         compiled: bool,
2025-05-07T20:32:43.4457253Z     ) -> None:
2025-05-07T20:32:43.4457468Z         torch.manual_seed(2025)
2025-05-07T20:32:43.4457717Z     
2025-05-07T20:32:43.4457991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.4458336Z     
2025-05-07T20:32:43.4458532Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.4458852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.4459480Z         x = x_sign * x_clamp
2025-05-07T20:32:43.4459731Z         x0 = x[:, :D]
2025-05-07T20:32:43.4459943Z         x1 = x[:, D:]
2025-05-07T20:32:43.4460160Z     
2025-05-07T20:32:43.4460487Z         if contiguous:
2025-05-07T20:32:43.4460720Z             x0 = x0.contiguous()
2025-05-07T20:32:43.4460991Z             x1 = x1.contiguous()
2025-05-07T20:32:43.4461238Z     
2025-05-07T20:32:43.4461428Z         if scale_ub is not None:
2025-05-07T20:32:43.4461710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.4462052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.4462357Z             )
2025-05-07T20:32:43.4462556Z         else:
2025-05-07T20:32:43.4462770Z             scale_ub_tensor = None
2025-05-07T20:32:43.4463021Z     
2025-05-07T20:32:43.4463256Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.4463574Z             op = silu_mul_quant
2025-05-07T20:32:43.4463826Z             if compiled:
2025-05-07T20:32:43.4464092Z                 op = torch.compile(op)
2025-05-07T20:32:43.4464390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.4464668Z     
2025-05-07T20:32:43.4464860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.4465039Z 
2025-05-07T20:32:43.4465141Z moe/activation_test.py:117: 
2025-05-07T20:32:43.4465742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4466081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.4466365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.4467055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.4467737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.4468278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.4468964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.4469631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.4470163Z     kernel = self.compile(
2025-05-07T20:32:43.4470710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.4471369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.4471768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4472001Z 
2025-05-07T20:32:43.4472210Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea2c6e10>
2025-05-07T20:32:43.4473296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.4474693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b262020>}
2025-05-07T20:32:43.4476101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.4477130Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b463fb0>
2025-05-07T20:32:43.4477425Z 
2025-05-07T20:32:43.4477593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.4478126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.4478610Z                            module_map=module_map)
2025-05-07T20:32:43.4478977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.4479347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.4479620Z E       ^
2025-05-07T20:32:43.4480214Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.4480673Z 
2025-05-07T20:32:43.4481087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.4481714Z 
2025-05-07T20:32:43.4481823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.4482242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.4482650Z     T=128,
2025-05-07T20:32:43.4482853Z     D=7168,
2025-05-07T20:32:43.4483066Z     scale_ub=None,
2025-05-07T20:32:43.4483279Z     contiguous=True,
2025-05-07T20:32:43.4483509Z     compiled=False,
2025-05-07T20:32:43.4483723Z )
2025-05-07T20:32:43.4484039Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.4484529Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.4484807Z 
2025-05-07T20:32:43.4484892Z     @given(
2025-05-07T20:32:43.4485124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.4485434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.4485752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.4486088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.4486411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.4486703Z     )
2025-05-07T20:32:43.4487055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.4487496Z     def test_silu_mul_quant(
2025-05-07T20:32:43.4487745Z         self,
2025-05-07T20:32:43.4487949Z         T: int,
2025-05-07T20:32:43.4488151Z         D: int,
2025-05-07T20:32:43.4488367Z         scale_ub: Optional[float],
2025-05-07T20:32:43.4488645Z         contiguous: bool,
2025-05-07T20:32:43.4488887Z         compiled: bool,
2025-05-07T20:32:43.4489121Z     ) -> None:
2025-05-07T20:32:43.4489352Z         torch.manual_seed(2025)
2025-05-07T20:32:43.4489600Z     
2025-05-07T20:32:43.4489875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.4490233Z     
2025-05-07T20:32:43.4490449Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.4490742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.4499582Z         x = x_sign * x_clamp
2025-05-07T20:32:43.4499848Z         x0 = x[:, :D]
2025-05-07T20:32:43.4500079Z         x1 = x[:, D:]
2025-05-07T20:32:43.4500293Z     
2025-05-07T20:32:43.4500492Z         if contiguous:
2025-05-07T20:32:43.4500734Z             x0 = x0.contiguous()
2025-05-07T20:32:43.4500996Z             x1 = x1.contiguous()
2025-05-07T20:32:43.4501243Z     
2025-05-07T20:32:43.4501444Z         if scale_ub is not None:
2025-05-07T20:32:43.4501728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.4502078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.4502391Z             )
2025-05-07T20:32:43.4502609Z         else:
2025-05-07T20:32:43.4502833Z             scale_ub_tensor = None
2025-05-07T20:32:43.4503079Z     
2025-05-07T20:32:43.4503315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.4503648Z             op = silu_mul_quant
2025-05-07T20:32:43.4503911Z             if compiled:
2025-05-07T20:32:43.4504171Z                 op = torch.compile(op)
2025-05-07T20:32:43.4504477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.4504759Z     
2025-05-07T20:32:43.4504968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.4505136Z 
2025-05-07T20:32:43.4505250Z moe/activation_test.py:117: 
2025-05-07T20:32:43.4505549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4505899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.4506192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.4507011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.4507708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.4508254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.4509021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.4509686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.4510228Z     kernel = self.compile(
2025-05-07T20:32:43.4510776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.4511441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.4511842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4512083Z 
2025-05-07T20:32:43.4512299Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea2c7080>
2025-05-07T20:32:43.4513390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.4514780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b262f20>}
2025-05-07T20:32:43.4516222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.4517249Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b148cf0>
2025-05-07T20:32:43.4517549Z 
2025-05-07T20:32:43.4517718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.4518252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.4518725Z                            module_map=module_map)
2025-05-07T20:32:43.4519095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.4519466Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.4519733Z E       ^
2025-05-07T20:32:43.4520198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.4520662Z 
2025-05-07T20:32:43.4521078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.4521588Z 
2025-05-07T20:32:43.4521701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.4522121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.4522525Z     T=2048,
2025-05-07T20:32:43.4522729Z     D=7168,
2025-05-07T20:32:43.4522932Z     scale_ub=1200.0,
2025-05-07T20:32:43.4523162Z     contiguous=True,
2025-05-07T20:32:43.4523397Z     compiled=False,
2025-05-07T20:32:43.4523612Z )
2025-05-07T20:32:43.5352512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5353250Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.5353661Z 
2025-05-07T20:32:43.5353785Z     @given(
2025-05-07T20:32:43.5354105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5354536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5354862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5355198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5355527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5355894Z     )
2025-05-07T20:32:43.5356249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5356693Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5357169Z         self,
2025-05-07T20:32:43.5357377Z         T: int,
2025-05-07T20:32:43.5357578Z         D: int,
2025-05-07T20:32:43.5357806Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5358123Z         contiguous: bool,
2025-05-07T20:32:43.5358506Z         compiled: bool,
2025-05-07T20:32:43.5358744Z     ) -> None:
2025-05-07T20:32:43.5358992Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5359271Z     
2025-05-07T20:32:43.5359549Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5361632Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.5363490Z 
2025-05-07T20:32:43.5363620Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.5363840Z 
2025-05-07T20:32:43.5363948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5364371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5364788Z     T=1,
2025-05-07T20:32:43.5364974Z     D=5120,
2025-05-07T20:32:43.5365173Z     scale_ub=1200.0,
2025-05-07T20:32:43.5365676Z     contiguous=True,
2025-05-07T20:32:43.5365907Z     compiled=False,
2025-05-07T20:32:43.5366120Z )
2025-05-07T20:32:43.5366446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5366928Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.5367198Z 
2025-05-07T20:32:43.5367276Z     @given(
2025-05-07T20:32:43.5367511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5367838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5368147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5368480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5368815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5369100Z     )
2025-05-07T20:32:43.5369456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5369901Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5370146Z         self,
2025-05-07T20:32:43.5370352Z         T: int,
2025-05-07T20:32:43.5370559Z         D: int,
2025-05-07T20:32:43.5370782Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5371065Z         contiguous: bool,
2025-05-07T20:32:43.5371310Z         compiled: bool,
2025-05-07T20:32:43.5371546Z     ) -> None:
2025-05-07T20:32:43.5371762Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5372011Z     
2025-05-07T20:32:43.5372299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5372641Z     
2025-05-07T20:32:43.5372848Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5373148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5373460Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5373709Z         x0 = x[:, :D]
2025-05-07T20:32:43.5373936Z         x1 = x[:, D:]
2025-05-07T20:32:43.5374150Z     
2025-05-07T20:32:43.5374351Z         if contiguous:
2025-05-07T20:32:43.5374590Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5374850Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5375105Z     
2025-05-07T20:32:43.5375317Z         if scale_ub is not None:
2025-05-07T20:32:43.5375592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5375932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5376247Z             )
2025-05-07T20:32:43.5376452Z         else:
2025-05-07T20:32:43.5376677Z             scale_ub_tensor = None
2025-05-07T20:32:43.5376929Z     
2025-05-07T20:32:43.5377298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5377622Z             op = silu_mul_quant
2025-05-07T20:32:43.5377882Z             if compiled:
2025-05-07T20:32:43.5378235Z                 op = torch.compile(op)
2025-05-07T20:32:43.5378540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5378822Z     
2025-05-07T20:32:43.5379016Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5379187Z 
2025-05-07T20:32:43.5379286Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5379584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5379918Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5380204Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5380901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5381589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5382131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5382812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5383482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5384015Z     kernel = self.compile(
2025-05-07T20:32:43.5384563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5385217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5385617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5385847Z 
2025-05-07T20:32:43.5386054Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea21fe90>
2025-05-07T20:32:43.5387140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5388510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b1004a0>}
2025-05-07T20:32:43.5389911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5390934Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b159670>
2025-05-07T20:32:43.5391222Z 
2025-05-07T20:32:43.5391387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5391915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5392395Z                            module_map=module_map)
2025-05-07T20:32:43.5392755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5393111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5393380Z E       ^
2025-05-07T20:32:43.5393851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5394303Z 
2025-05-07T20:32:43.5394713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5395226Z 
2025-05-07T20:32:43.5395331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5395790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5396194Z     T=2048,
2025-05-07T20:32:43.5396394Z     D=5120,
2025-05-07T20:32:43.5396591Z     scale_ub=None,
2025-05-07T20:32:43.5396803Z     contiguous=True,
2025-05-07T20:32:43.5397037Z     compiled=False,
2025-05-07T20:32:43.5397248Z )
2025-05-07T20:32:43.5397656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5398154Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.5398590Z 
2025-05-07T20:32:43.5398672Z     @given(
2025-05-07T20:32:43.5398911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5399220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5399537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5399870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5400204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5400499Z     )
2025-05-07T20:32:43.5400853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5401306Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5401549Z         self,
2025-05-07T20:32:43.5401750Z         T: int,
2025-05-07T20:32:43.5401958Z         D: int,
2025-05-07T20:32:43.5402181Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5402458Z         contiguous: bool,
2025-05-07T20:32:43.5402703Z         compiled: bool,
2025-05-07T20:32:43.5402926Z     ) -> None:
2025-05-07T20:32:43.5403157Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5403406Z     
2025-05-07T20:32:43.5403675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5404021Z     
2025-05-07T20:32:43.5404225Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.5406184Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.5408040Z 
2025-05-07T20:32:43.5408165Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.5408376Z 
2025-05-07T20:32:43.5408484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5408901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5409305Z     T=16384,
2025-05-07T20:32:43.5409500Z     D=5120,
2025-05-07T20:32:43.5409698Z     scale_ub=None,
2025-05-07T20:32:43.5409914Z     contiguous=True,
2025-05-07T20:32:43.5410134Z     compiled=False,
2025-05-07T20:32:43.5410342Z )
2025-05-07T20:32:43.6174082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6174604Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.6174924Z 
2025-05-07T20:32:43.6175034Z     @given(
2025-05-07T20:32:43.6175364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6175833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6176252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6176706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6177105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6177392Z     )
2025-05-07T20:32:43.6177745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6178192Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6178434Z         self,
2025-05-07T20:32:43.6178640Z         T: int,
2025-05-07T20:32:43.6178841Z         D: int,
2025-05-07T20:32:43.6179085Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6179387Z         contiguous: bool,
2025-05-07T20:32:43.6179633Z         compiled: bool,
2025-05-07T20:32:43.6179868Z     ) -> None:
2025-05-07T20:32:43.6180083Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6180329Z     
2025-05-07T20:32:43.6180610Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6182837Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.6184820Z 
2025-05-07T20:32:43.6184941Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.6185160Z 
2025-05-07T20:32:43.6185266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6185694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6186106Z     T=4096,
2025-05-07T20:32:43.6186300Z     D=5120,
2025-05-07T20:32:43.6186510Z     scale_ub=None,
2025-05-07T20:32:43.6186729Z     contiguous=True,
2025-05-07T20:32:43.6186954Z     compiled=False,
2025-05-07T20:32:43.6187166Z )
2025-05-07T20:32:43.6187489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6187988Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.6188270Z 
2025-05-07T20:32:43.6188352Z     @given(
2025-05-07T20:32:43.6188589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6188903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6189222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6189562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6189903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6190196Z     )
2025-05-07T20:32:43.6190547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6190993Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6191239Z         self,
2025-05-07T20:32:43.6191438Z         T: int,
2025-05-07T20:32:43.6191640Z         D: int,
2025-05-07T20:32:43.6191860Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6192141Z         contiguous: bool,
2025-05-07T20:32:43.6192386Z         compiled: bool,
2025-05-07T20:32:43.6192609Z     ) -> None:
2025-05-07T20:32:43.6192834Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6193081Z     
2025-05-07T20:32:43.6193355Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6195406Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.6197351Z 
2025-05-07T20:32:43.6197470Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.6197692Z 
2025-05-07T20:32:43.6197796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6198216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6198621Z     T=2048,
2025-05-07T20:32:43.6198817Z     D=5120,
2025-05-07T20:32:43.6199019Z     scale_ub=None,
2025-05-07T20:32:43.6199232Z     contiguous=False,
2025-05-07T20:32:43.6199460Z     compiled=False,
2025-05-07T20:32:43.6199668Z )
2025-05-07T20:32:43.6199991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6200494Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.6200797Z 
2025-05-07T20:32:43.6200878Z     @given(
2025-05-07T20:32:43.6201203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6201524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6201829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6202236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6202572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6202857Z     )
2025-05-07T20:32:43.6203218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6203673Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6203915Z         self,
2025-05-07T20:32:43.6204118Z         T: int,
2025-05-07T20:32:43.6204325Z         D: int,
2025-05-07T20:32:43.6204547Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6204825Z         contiguous: bool,
2025-05-07T20:32:43.6205077Z         compiled: bool,
2025-05-07T20:32:43.6205307Z     ) -> None:
2025-05-07T20:32:43.6205525Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6205784Z     
2025-05-07T20:32:43.6206084Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6208137Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.6209996Z 
2025-05-07T20:32:43.6210129Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.6210345Z 
2025-05-07T20:32:43.6210456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6210885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6211299Z     T=4096,
2025-05-07T20:32:43.6211499Z     D=7168,
2025-05-07T20:32:43.6211700Z     scale_ub=None,
2025-05-07T20:32:43.6211923Z     contiguous=True,
2025-05-07T20:32:43.6212146Z     compiled=True,
2025-05-07T20:32:43.6212366Z )
2025-05-07T20:32:43.6212691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6213178Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.6213454Z 
2025-05-07T20:32:43.6213535Z     @given(
2025-05-07T20:32:43.6213773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6214099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6214407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6214747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6215082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6215373Z     )
2025-05-07T20:32:43.6215739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6216191Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6216434Z         self,
2025-05-07T20:32:43.6216633Z         T: int,
2025-05-07T20:32:43.6216833Z         D: int,
2025-05-07T20:32:43.6217055Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6217336Z         contiguous: bool,
2025-05-07T20:32:43.6217585Z         compiled: bool,
2025-05-07T20:32:43.6217810Z     ) -> None:
2025-05-07T20:32:43.6218035Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6218281Z     
2025-05-07T20:32:43.6218558Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6220695Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.6222641Z 
2025-05-07T20:32:43.6222764Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.6222988Z 
2025-05-07T20:32:43.6223098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6223523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6223944Z     T=2048,
2025-05-07T20:32:43.6224137Z     D=5120,
2025-05-07T20:32:43.6224341Z     scale_ub=1200.0,
2025-05-07T20:32:43.6224579Z     contiguous=False,
2025-05-07T20:32:43.6224812Z     compiled=False,
2025-05-07T20:32:43.6225031Z )
2025-05-07T20:32:43.6225361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6225859Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.6226149Z 
2025-05-07T20:32:43.6226238Z     @given(
2025-05-07T20:32:43.6226483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6226803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6227130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6227486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6227829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6228125Z     )
2025-05-07T20:32:43.6228492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6228958Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6229209Z         self,
2025-05-07T20:32:43.6229425Z         T: int,
2025-05-07T20:32:43.6229634Z         D: int,
2025-05-07T20:32:43.6229857Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6230154Z         contiguous: bool,
2025-05-07T20:32:43.6230417Z         compiled: bool,
2025-05-07T20:32:43.6230646Z     ) -> None:
2025-05-07T20:32:43.6230888Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6231149Z     
2025-05-07T20:32:43.6231427Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6233489Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.6235361Z 
2025-05-07T20:32:43.6235482Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.6235756Z 
2025-05-07T20:32:43.6235866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6236294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6236706Z     T=4096,
2025-05-07T20:32:43.6236913Z     D=7168,
2025-05-07T20:32:43.6237127Z     scale_ub=1200.0,
2025-05-07T20:32:43.6237351Z     contiguous=True,
2025-05-07T20:32:43.6237611Z     compiled=False,
2025-05-07T20:32:43.6237830Z )
2025-05-07T20:32:43.7294688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7295244Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.7295533Z 
2025-05-07T20:32:43.7295618Z     @given(
2025-05-07T20:32:43.7295857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7296179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7296501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7305060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7305418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7305717Z     )
2025-05-07T20:32:43.7306251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7306711Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7306957Z         self,
2025-05-07T20:32:43.7307276Z         T: int,
2025-05-07T20:32:43.7307476Z         D: int,
2025-05-07T20:32:43.7307701Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7307980Z         contiguous: bool,
2025-05-07T20:32:43.7308222Z         compiled: bool,
2025-05-07T20:32:43.7308460Z     ) -> None:
2025-05-07T20:32:43.7308684Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7308952Z     
2025-05-07T20:32:43.7309260Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7311334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.7313210Z 
2025-05-07T20:32:43.7313340Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.7313587Z 
2025-05-07T20:32:43.7313700Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7314128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7314535Z     T=16384,
2025-05-07T20:32:43.7314739Z     D=7168,
2025-05-07T20:32:43.7314946Z     scale_ub=None,
2025-05-07T20:32:43.7315165Z     contiguous=False,
2025-05-07T20:32:43.7315401Z     compiled=True,
2025-05-07T20:32:43.7315615Z )
2025-05-07T20:32:43.7316005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7316509Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.7316799Z 
2025-05-07T20:32:43.7316886Z     @given(
2025-05-07T20:32:43.7317132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7317452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7317772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7318112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7318444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7318738Z     )
2025-05-07T20:32:43.7319092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7319552Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7319803Z         self,
2025-05-07T20:32:43.7320006Z         T: int,
2025-05-07T20:32:43.7320218Z         D: int,
2025-05-07T20:32:43.7320442Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7320720Z         contiguous: bool,
2025-05-07T20:32:43.7320972Z         compiled: bool,
2025-05-07T20:32:43.7321203Z     ) -> None:
2025-05-07T20:32:43.7321430Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7321681Z     
2025-05-07T20:32:43.7321951Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7324017Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.7325886Z 
2025-05-07T20:32:43.7326006Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.7326224Z 
2025-05-07T20:32:43.7326416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7326839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7327240Z     T=4096,
2025-05-07T20:32:43.7327440Z     D=7168,
2025-05-07T20:32:43.7327714Z     scale_ub=None,
2025-05-07T20:32:43.7327930Z     contiguous=True,
2025-05-07T20:32:43.7328158Z     compiled=False,
2025-05-07T20:32:43.7328374Z )
2025-05-07T20:32:43.7328691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7329192Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.7329462Z 
2025-05-07T20:32:43.7329546Z     @given(
2025-05-07T20:32:43.7329787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7330111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7330424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7330767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7331102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7331401Z     )
2025-05-07T20:32:43.7331760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7332211Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7332460Z         self,
2025-05-07T20:32:43.7332660Z         T: int,
2025-05-07T20:32:43.7332861Z         D: int,
2025-05-07T20:32:43.7333087Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7333364Z         contiguous: bool,
2025-05-07T20:32:43.7333611Z         compiled: bool,
2025-05-07T20:32:43.7333847Z     ) -> None:
2025-05-07T20:32:43.7334067Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7334315Z     
2025-05-07T20:32:43.7334593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7336655Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.7338530Z 
2025-05-07T20:32:43.7338653Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.7338869Z 
2025-05-07T20:32:43.7338986Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7339400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7339808Z     T=16384,
2025-05-07T20:32:43.7340011Z     D=7168,
2025-05-07T20:32:43.7340202Z     scale_ub=None,
2025-05-07T20:32:43.7340423Z     contiguous=True,
2025-05-07T20:32:43.7340652Z     compiled=False,
2025-05-07T20:32:43.7340860Z )
2025-05-07T20:32:43.7341181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7341688Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.7341966Z 
2025-05-07T20:32:43.7342051Z     @given(
2025-05-07T20:32:43.7342283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7342604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7342921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7343251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7343586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7343878Z     )
2025-05-07T20:32:43.7344232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7344676Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7344921Z         self,
2025-05-07T20:32:43.7345121Z         T: int,
2025-05-07T20:32:43.7345317Z         D: int,
2025-05-07T20:32:43.7345540Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7345905Z         contiguous: bool,
2025-05-07T20:32:43.7346151Z         compiled: bool,
2025-05-07T20:32:43.7346386Z     ) -> None:
2025-05-07T20:32:43.7346608Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7346852Z     
2025-05-07T20:32:43.7347203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7349318Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.7351191Z 
2025-05-07T20:32:43.7351316Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.7351544Z 
2025-05-07T20:32:43.7351659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7352081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7352497Z     T=16384,
2025-05-07T20:32:43.7352707Z     D=7168,
2025-05-07T20:32:43.7352904Z     scale_ub=1200.0,
2025-05-07T20:32:43.7353135Z     contiguous=True,
2025-05-07T20:32:43.7353371Z     compiled=False,
2025-05-07T20:32:43.7353581Z )
2025-05-07T20:32:43.7353910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7354423Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.7354704Z 
2025-05-07T20:32:43.7354795Z     @given(
2025-05-07T20:32:43.7355031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7355362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7355675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7356083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7356428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7356713Z     )
2025-05-07T20:32:43.7357063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7357523Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7357769Z         self,
2025-05-07T20:32:43.7357969Z         T: int,
2025-05-07T20:32:43.7358172Z         D: int,
2025-05-07T20:32:43.7358390Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7358664Z         contiguous: bool,
2025-05-07T20:32:43.7358920Z         compiled: bool,
2025-05-07T20:32:43.7359168Z     ) -> None:
2025-05-07T20:32:43.7359413Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7359666Z     
2025-05-07T20:32:43.7359935Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7361998Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.7363867Z 
2025-05-07T20:32:43.7363991Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.7364213Z 
2025-05-07T20:32:43.7364319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7364737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7365137Z     T=128,
2025-05-07T20:32:43.7365335Z     D=5120,
2025-05-07T20:32:43.7365816Z     scale_ub=1200.0,
2025-05-07T20:32:43.7366040Z     contiguous=False,
2025-05-07T20:32:43.7366270Z     compiled=False,
2025-05-07T20:32:43.7366481Z )
2025-05-07T20:32:43.8632236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8632788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.8633206Z 
2025-05-07T20:32:43.8633285Z     @given(
2025-05-07T20:32:43.8633517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8633831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8634133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8634462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8634792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8635074Z     )
2025-05-07T20:32:43.8635422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8635937Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8636182Z         self,
2025-05-07T20:32:43.8636375Z         T: int,
2025-05-07T20:32:43.8636577Z         D: int,
2025-05-07T20:32:43.8636802Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8637071Z         contiguous: bool,
2025-05-07T20:32:43.8637313Z         compiled: bool,
2025-05-07T20:32:43.8637544Z     ) -> None:
2025-05-07T20:32:43.8637767Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8638013Z     
2025-05-07T20:32:43.8638289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8638627Z     
2025-05-07T20:32:43.8638835Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8639132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8639435Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8639683Z         x0 = x[:, :D]
2025-05-07T20:32:43.8639906Z         x1 = x[:, D:]
2025-05-07T20:32:43.8640109Z     
2025-05-07T20:32:43.8640302Z         if contiguous:
2025-05-07T20:32:43.8640537Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8640796Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8641046Z     
2025-05-07T20:32:43.8641250Z         if scale_ub is not None:
2025-05-07T20:32:43.8641531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8641861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8642183Z             )
2025-05-07T20:32:43.8642375Z         else:
2025-05-07T20:32:43.8642591Z             scale_ub_tensor = None
2025-05-07T20:32:43.8642846Z     
2025-05-07T20:32:43.8643079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8643393Z             op = silu_mul_quant
2025-05-07T20:32:43.8643650Z             if compiled:
2025-05-07T20:32:43.8643900Z                 op = torch.compile(op)
2025-05-07T20:32:43.8644197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8644471Z     
2025-05-07T20:32:43.8644667Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8644829Z 
2025-05-07T20:32:43.8644930Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8645227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8645566Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8645849Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8646536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8647228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8647759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8648434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8649099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8649635Z     kernel = self.compile(
2025-05-07T20:32:43.8650175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8650911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8651317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8651547Z 
2025-05-07T20:32:43.8651758Z self = <triton.compiler.compiler.ASTSource object at 0x7f5b8b0a82f0>
2025-05-07T20:32:43.8652920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8654289Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b153060>}
2025-05-07T20:32:43.8655629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8656662Z context = <triton._C.libtriton.ir.context object at 0x7f5b8af7bab0>
2025-05-07T20:32:43.8656950Z 
2025-05-07T20:32:43.8657119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8657640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8658115Z                            module_map=module_map)
2025-05-07T20:32:43.8658482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8658849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8659130Z E       ^
2025-05-07T20:32:43.8659617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8660065Z 
2025-05-07T20:32:43.8660485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8660993Z 
2025-05-07T20:32:43.8661104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8661520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8661926Z     T=2048,
2025-05-07T20:32:43.8662115Z     D=7168,
2025-05-07T20:32:43.8662306Z     scale_ub=None,
2025-05-07T20:32:43.8662530Z     contiguous=False,
2025-05-07T20:32:43.8662757Z     compiled=False,
2025-05-07T20:32:43.8662960Z )
2025-05-07T20:32:43.8663276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8663778Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.8664050Z 
2025-05-07T20:32:43.8664130Z     @given(
2025-05-07T20:32:43.8664361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8664677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8664986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8665313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8665820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8666109Z     )
2025-05-07T20:32:43.8666451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8666889Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8667136Z         self,
2025-05-07T20:32:43.8667327Z         T: int,
2025-05-07T20:32:43.8667529Z         D: int,
2025-05-07T20:32:43.8667753Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8668024Z         contiguous: bool,
2025-05-07T20:32:43.8668264Z         compiled: bool,
2025-05-07T20:32:43.8668491Z     ) -> None:
2025-05-07T20:32:43.8668705Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8668949Z     
2025-05-07T20:32:43.8669266Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8671452Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.8673411Z 
2025-05-07T20:32:43.8673533Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.8673742Z 
2025-05-07T20:32:43.8673848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8674259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8674662Z     T=128,
2025-05-07T20:32:43.8674846Z     D=7168,
2025-05-07T20:32:43.8675038Z     scale_ub=1200.0,
2025-05-07T20:32:43.8675261Z     contiguous=True,
2025-05-07T20:32:43.8675482Z     compiled=True,
2025-05-07T20:32:43.8675683Z )
2025-05-07T20:32:43.8987867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8988393Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.8988690Z 
2025-05-07T20:32:43.8988798Z     @given(
2025-05-07T20:32:43.8989045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8989425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8989792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8990184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8990516Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8990806Z     )
2025-05-07T20:32:43.8991157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8991607Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8991859Z         self,
2025-05-07T20:32:43.8992063Z         T: int,
2025-05-07T20:32:43.8992259Z         D: int,
2025-05-07T20:32:43.8992490Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8992771Z         contiguous: bool,
2025-05-07T20:32:43.8993021Z         compiled: bool,
2025-05-07T20:32:43.8993253Z     ) -> None:
2025-05-07T20:32:43.8993473Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8993716Z     
2025-05-07T20:32:43.8993993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8994349Z     
2025-05-07T20:32:43.8994545Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8994841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8995160Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8995399Z         x0 = x[:, :D]
2025-05-07T20:32:43.8995630Z         x1 = x[:, D:]
2025-05-07T20:32:43.8995901Z     
2025-05-07T20:32:43.8996089Z         if contiguous:
2025-05-07T20:32:43.8996329Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8996595Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8996837Z     
2025-05-07T20:32:43.8997032Z         if scale_ub is not None:
2025-05-07T20:32:43.8997308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8997653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8997967Z             )
2025-05-07T20:32:43.8998162Z         else:
2025-05-07T20:32:43.8998380Z             scale_ub_tensor = None
2025-05-07T20:32:43.8998640Z     
2025-05-07T20:32:43.8998884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8999207Z             op = silu_mul_quant
2025-05-07T20:32:43.8999506Z             if compiled:
2025-05-07T20:32:43.8999758Z                 op = torch.compile(op)
2025-05-07T20:32:43.9000059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.9000334Z     
2025-05-07T20:32:43.9000531Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.9000721Z 
2025-05-07T20:32:43.9000825Z moe/activation_test.py:117: 
2025-05-07T20:32:43.9001130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.9001460Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.9001746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.9002480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.9003050Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.9003816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.9004505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.9005046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.9005725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.9006395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.9006934Z     kernel = self.compile(
2025-05-07T20:32:43.9007475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.9008134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.9008537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.9008773Z 
2025-05-07T20:32:43.9008987Z self = <triton.compiler.compiler.ASTSource object at 0x7f5b8af06ed0>
2025-05-07T20:32:43.9010076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.9011452Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8afe0900>}
2025-05-07T20:32:43.9012805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.9013837Z context = <triton._C.libtriton.ir.context object at 0x7f5b8afb4b70>
2025-05-07T20:32:43.9014126Z 
2025-05-07T20:32:43.9014298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.9014832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.9015309Z                            module_map=module_map)
2025-05-07T20:32:43.9015678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.9016043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.9016349Z E       ^
2025-05-07T20:32:43.9016894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.9017350Z 
2025-05-07T20:32:43.9017772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.9018283Z 
2025-05-07T20:32:43.9018395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9018817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9019224Z     T=128,
2025-05-07T20:32:43.9019425Z     D=7168,
2025-05-07T20:32:43.9019618Z     scale_ub=1200.0,
2025-05-07T20:32:43.9019844Z     contiguous=True,
2025-05-07T20:32:43.9020071Z     compiled=False,
2025-05-07T20:32:43.9020276Z )
2025-05-07T20:32:43.9020599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9021099Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.9021372Z 
2025-05-07T20:32:43.9021454Z     @given(
2025-05-07T20:32:43.9021689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9022005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9022312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9022648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9023097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9023389Z     )
2025-05-07T20:32:43.9023735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9024263Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9024509Z         self,
2025-05-07T20:32:43.9024705Z         T: int,
2025-05-07T20:32:43.9024907Z         D: int,
2025-05-07T20:32:43.9025130Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9025402Z         contiguous: bool,
2025-05-07T20:32:43.9025643Z         compiled: bool,
2025-05-07T20:32:43.9025868Z     ) -> None:
2025-05-07T20:32:43.9026083Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9026328Z     
2025-05-07T20:32:43.9026609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9026949Z     
2025-05-07T20:32:43.9027154Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9027452Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9029474Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9031338Z 
2025-05-07T20:32:43.9031465Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9031682Z 
2025-05-07T20:32:43.9031788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9032203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9032612Z     T=128,
2025-05-07T20:32:43.9032797Z     D=5120,
2025-05-07T20:32:43.9032991Z     scale_ub=1200.0,
2025-05-07T20:32:43.9033221Z     contiguous=True,
2025-05-07T20:32:43.9040574Z     compiled=True,
2025-05-07T20:32:43.9040824Z )
2025-05-07T20:32:43.9041153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9041658Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.9041930Z 
2025-05-07T20:32:43.9042013Z     @given(
2025-05-07T20:32:43.9042247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9042568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9042877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9043207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9043539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9043827Z     )
2025-05-07T20:32:43.9044173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9044616Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9044865Z         self,
2025-05-07T20:32:43.9045062Z         T: int,
2025-05-07T20:32:43.9045264Z         D: int,
2025-05-07T20:32:43.9045486Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9045757Z         contiguous: bool,
2025-05-07T20:32:43.9046003Z         compiled: bool,
2025-05-07T20:32:43.9046232Z     ) -> None:
2025-05-07T20:32:43.9046446Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9046694Z     
2025-05-07T20:32:43.9046971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9047320Z     
2025-05-07T20:32:43.9047512Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.9049574Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9051511Z 
2025-05-07T20:32:43.9051631Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.9051843Z 
2025-05-07T20:32:43.9051955Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9052366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9052773Z     T=128,
2025-05-07T20:32:43.9052968Z     D=7168,
2025-05-07T20:32:43.9053157Z     scale_ub=None,
2025-05-07T20:32:43.9053371Z     contiguous=True,
2025-05-07T20:32:43.9053595Z     compiled=True,
2025-05-07T20:32:43.9053800Z )
2025-05-07T20:32:44.2296320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2296838Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2297113Z 
2025-05-07T20:32:44.2297202Z     @given(
2025-05-07T20:32:44.2297433Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2297737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2298046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2298378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2298699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2298987Z     )
2025-05-07T20:32:44.2299332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2299769Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2300015Z         self,
2025-05-07T20:32:44.2300216Z         T: int,
2025-05-07T20:32:44.2300408Z         D: int,
2025-05-07T20:32:44.2300628Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2300901Z         contiguous: bool,
2025-05-07T20:32:44.2301143Z         compiled: bool,
2025-05-07T20:32:44.2301364Z     ) -> None:
2025-05-07T20:32:44.2301580Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2301828Z     
2025-05-07T20:32:44.2302099Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2304154Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.2306016Z 
2025-05-07T20:32:44.2306133Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.2306342Z 
2025-05-07T20:32:44.2390608Z FAILED
2025-05-07T20:32:44.2390776Z 
2025-05-07T20:32:44.2390914Z =================================== FAILURES ===================================
2025-05-07T20:32:44.2391375Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:44.2392014Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:44.2392853Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:44.2393616Z   |     yield
2025-05-07T20:32:44.2394205Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:44.2394931Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:44.2395792Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:44.2396558Z   |     if method() is not None:
2025-05-07T20:32:44.2396905Z   |        ^^^^^^^^
2025-05-07T20:32:44.2398017Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:44.2399022Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2399461Z   |            ^^^^^^^
2025-05-07T20:32:44.2400243Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:44.2401227Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:44.2401803Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:44.2402387Z   +-+---------------- 1 ----------------
2025-05-07T20:32:44.2402785Z     | Traceback (most recent call last):
2025-05-07T20:32:44.2403742Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.2404806Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2405337Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2408048Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.2410776Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.2411379Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2411942Z     |     T=128,
2025-05-07T20:32:44.2412227Z     |     D=7168,
2025-05-07T20:32:44.2412518Z     |     scale_ub=1200.0,
2025-05-07T20:32:44.2412840Z     |     contiguous=True,
2025-05-07T20:32:44.2413179Z     |     compiled=False,
2025-05-07T20:32:44.2413407Z     | )
2025-05-07T20:32:44.2413580Z     | 
2025-05-07T20:32:44.2414103Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:44.2414714Z     +---------------- 2 ----------------
2025-05-07T20:32:44.2415006Z     | Traceback (most recent call last):
2025-05-07T20:32:44.2415701Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.2416467Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2416840Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2418823Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.2420783Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.2421210Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2421614Z     |     T=128,
2025-05-07T20:32:44.2421814Z     |     D=7168,
2025-05-07T20:32:44.2422014Z     |     scale_ub=None,
2025-05-07T20:32:44.2422253Z     |     contiguous=True,
2025-05-07T20:32:44.2422491Z     |     compiled=True,
2025-05-07T20:32:44.2422707Z     | )
2025-05-07T20:32:44.2422884Z     | 
2025-05-07T20:32:44.2423503Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.2424106Z     +---------------- 3 ----------------
2025-05-07T20:32:44.2424390Z     | Traceback (most recent call last):
2025-05-07T20:32:44.2425162Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.2425927Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2426293Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2428266Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.2431011Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.2431634Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2432067Z     |     T=128,
2025-05-07T20:32:44.2432315Z     |     D=5120,
2025-05-07T20:32:44.2432614Z     |     scale_ub=1200.0,
2025-05-07T20:32:44.2432957Z     |     contiguous=True,
2025-05-07T20:32:44.2433293Z     |     compiled=True,
2025-05-07T20:32:44.2433608Z     | )
2025-05-07T20:32:44.2433863Z     | 
2025-05-07T20:32:44.2434607Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.2435476Z     +---------------- 4 ----------------
2025-05-07T20:32:44.2435979Z     | Traceback (most recent call last):
2025-05-07T20:32:44.2436972Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:44.2437931Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2438330Z     |                              ^^^^^^^^
2025-05-07T20:32:44.2439231Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:44.2440177Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2440634Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2441721Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:44.2442808Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2443631Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:44.2444623Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2445247Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2446140Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:44.2447206Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2447834Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2448715Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:44.2449650Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2450287Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2451101Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:44.2451958Z     |     fn()
2025-05-07T20:32:44.2452757Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:44.2453598Z     |     self.fn.run(
2025-05-07T20:32:44.2454314Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:44.2454893Z     |     kernel = self.compile(
2025-05-07T20:32:44.2455149Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:44.2455733Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:44.2456433Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2456812Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2457556Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.2458582Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2459064Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.2459442Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2459788Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2460049Z     | ^
2025-05-07T20:32:44.2460509Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2461073Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.2461474Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:44.2461987Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2462418Z     |     T=1,  # or any other generated value
2025-05-07T20:32:44.2462730Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:44.2463064Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:44.2463429Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:44.2463789Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:44.2464089Z     | )
2025-05-07T20:32:44.2464275Z     | 
2025-05-07T20:32:44.2464797Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.2465757Z     +------------------------------------
2025-05-07T20:32:44.2466123Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:44.2466497Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2466904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2467302Z     T=1,
2025-05-07T20:32:44.2467493Z     D=5120,
2025-05-07T20:32:44.2467681Z     scale_ub=None,
2025-05-07T20:32:44.2467893Z     contiguous=True,
2025-05-07T20:32:44.2468111Z     compiled=True,
2025-05-07T20:32:44.2468309Z )
2025-05-07T20:32:44.2468624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2469108Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2469414Z 
2025-05-07T20:32:44.2469498Z     @given(
2025-05-07T20:32:44.2469723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2470033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2470339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2470666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2471152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2471550Z     )
2025-05-07T20:32:44.2472030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2472948Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2473297Z         self,
2025-05-07T20:32:44.2473568Z         T: int,
2025-05-07T20:32:44.2473860Z         D: int,
2025-05-07T20:32:44.2474176Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2474540Z         contiguous: bool,
2025-05-07T20:32:44.2474778Z         compiled: bool,
2025-05-07T20:32:44.2475002Z     ) -> None:
2025-05-07T20:32:44.2475219Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2475457Z     
2025-05-07T20:32:44.2475874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2476358Z     
2025-05-07T20:32:44.2476621Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2477020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2477465Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2477791Z         x0 = x[:, :D]
2025-05-07T20:32:44.2478090Z         x1 = x[:, D:]
2025-05-07T20:32:44.2478371Z     
2025-05-07T20:32:44.2478621Z         if contiguous:
2025-05-07T20:32:44.2478953Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2479307Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2479648Z     
2025-05-07T20:32:44.2479921Z         if scale_ub is not None:
2025-05-07T20:32:44.2480308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2480771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2481206Z             )
2025-05-07T20:32:44.2481484Z         else:
2025-05-07T20:32:44.2481781Z             scale_ub_tensor = None
2025-05-07T20:32:44.2482135Z     
2025-05-07T20:32:44.2482462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2482895Z             op = silu_mul_quant
2025-05-07T20:32:44.2483256Z             if compiled:
2025-05-07T20:32:44.2483606Z                 op = torch.compile(op)
2025-05-07T20:32:44.2484021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2484401Z     
2025-05-07T20:32:44.2484676Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2485084Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2485469Z     
2025-05-07T20:32:44.2485799Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2486266Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2486671Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2487110Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2487605Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2488037Z     
2025-05-07T20:32:44.2488330Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2488605Z 
2025-05-07T20:32:44.2488748Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2489167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2489619Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2490077Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2491172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2492214Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2492965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2493907Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2494863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2495849Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2496994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2497866Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2498697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2499554Z     fn()
2025-05-07T20:32:44.2500260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2501060Z     self.fn.run(
2025-05-07T20:32:44.2501708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2502447Z     kernel = self.compile(
2025-05-07T20:32:44.2503188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2504078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2504634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2504964Z 
2025-05-07T20:32:44.2505245Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed2150>
2025-05-07T20:32:44.2506716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2508625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04acc360>}
2025-05-07T20:32:44.2510491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2511887Z context = <triton._C.libtriton.ir.context object at 0x7f5c04abddf0>
2025-05-07T20:32:44.2512271Z 
2025-05-07T20:32:44.2512505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2513226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2513875Z                            module_map=module_map)
2025-05-07T20:32:44.2514378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2514878Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2515251Z E       ^
2025-05-07T20:32:44.2515991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2516611Z 
2025-05-07T20:32:44.2517174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2517868Z 
2025-05-07T20:32:44.2518018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2518551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2519123Z     T=2048,
2025-05-07T20:32:44.2519424Z     D=5120,
2025-05-07T20:32:44.2519686Z     scale_ub=1200.0,
2025-05-07T20:32:44.2519988Z     contiguous=True,
2025-05-07T20:32:44.2520282Z     compiled=False,
2025-05-07T20:32:44.2520585Z )
2025-05-07T20:32:44.2521019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2521695Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.2522067Z 
2025-05-07T20:32:44.2522184Z     @given(
2025-05-07T20:32:44.2522500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2522937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2523375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2523833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2524295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2524695Z     )
2025-05-07T20:32:44.2525295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2525909Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2526254Z         self,
2025-05-07T20:32:44.2526550Z         T: int,
2025-05-07T20:32:44.2526934Z         D: int,
2025-05-07T20:32:44.2527251Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2527623Z         contiguous: bool,
2025-05-07T20:32:44.2527965Z         compiled: bool,
2025-05-07T20:32:44.2528290Z     ) -> None:
2025-05-07T20:32:44.2528608Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2528952Z     
2025-05-07T20:32:44.2529385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2529863Z     
2025-05-07T20:32:44.2530135Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2530548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2530987Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2552266Z         x0 = x[:, :D]
2025-05-07T20:32:44.2552592Z         x1 = x[:, D:]
2025-05-07T20:32:44.2552897Z     
2025-05-07T20:32:44.2553189Z         if contiguous:
2025-05-07T20:32:44.2553477Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2553804Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2554092Z     
2025-05-07T20:32:44.2554335Z         if scale_ub is not None:
2025-05-07T20:32:44.2554667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2555075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2555452Z             )
2025-05-07T20:32:44.2555838Z         else:
2025-05-07T20:32:44.2556147Z             scale_ub_tensor = None
2025-05-07T20:32:44.2556515Z     
2025-05-07T20:32:44.2556856Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2557291Z             op = silu_mul_quant
2025-05-07T20:32:44.2557636Z             if compiled:
2025-05-07T20:32:44.2557986Z                 op = torch.compile(op)
2025-05-07T20:32:44.2558392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2558727Z     
2025-05-07T20:32:44.2559000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2559223Z 
2025-05-07T20:32:44.2559365Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2559779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2560212Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2560619Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2561578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2562549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2563293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2564206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2565764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2566471Z     kernel = self.compile(
2025-05-07T20:32:44.2567161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2567978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2568529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2568850Z 
2025-05-07T20:32:44.2569149Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed1970>
2025-05-07T20:32:44.2570617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2572482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c054f1e40>}
2025-05-07T20:32:44.2574606Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2576124Z context = <triton._C.libtriton.ir.context object at 0x7f5c05228d70>
2025-05-07T20:32:44.2578928Z 
2025-05-07T20:32:44.2579168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2579906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2580544Z                            module_map=module_map)
2025-05-07T20:32:44.2581043Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2581528Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2581884Z E       ^
2025-05-07T20:32:44.2582523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2583149Z 
2025-05-07T20:32:44.2583725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2584414Z 
2025-05-07T20:32:44.2584566Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2585114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2585639Z     T=2048,
2025-05-07T20:32:44.2585900Z     D=5120,
2025-05-07T20:32:44.2586163Z     scale_ub=1200.0,
2025-05-07T20:32:44.2586473Z     contiguous=True,
2025-05-07T20:32:44.2586779Z     compiled=True,
2025-05-07T20:32:44.2587039Z )
2025-05-07T20:32:44.2587468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2588118Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.2588468Z 
2025-05-07T20:32:44.2588582Z     @given(
2025-05-07T20:32:44.2588889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2589326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2589749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2590206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2590663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2591072Z     )
2025-05-07T20:32:44.2591547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2592156Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2592496Z         self,
2025-05-07T20:32:44.2592775Z         T: int,
2025-05-07T20:32:44.2593050Z         D: int,
2025-05-07T20:32:44.2593359Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2593738Z         contiguous: bool,
2025-05-07T20:32:44.2594068Z         compiled: bool,
2025-05-07T20:32:44.2594384Z     ) -> None:
2025-05-07T20:32:44.2594691Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2595027Z     
2025-05-07T20:32:44.2595413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2596020Z     
2025-05-07T20:32:44.2596298Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2596713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2597142Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2597480Z         x0 = x[:, :D]
2025-05-07T20:32:44.2597791Z         x1 = x[:, D:]
2025-05-07T20:32:44.2598091Z     
2025-05-07T20:32:44.2598351Z         if contiguous:
2025-05-07T20:32:44.2598689Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2599061Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2599445Z     
2025-05-07T20:32:44.2599713Z         if scale_ub is not None:
2025-05-07T20:32:44.2600100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2600570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2600998Z             )
2025-05-07T20:32:44.2601280Z         else:
2025-05-07T20:32:44.2601583Z             scale_ub_tensor = None
2025-05-07T20:32:44.2601934Z     
2025-05-07T20:32:44.2602431Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2602885Z             op = silu_mul_quant
2025-05-07T20:32:44.2603231Z             if compiled:
2025-05-07T20:32:44.2603584Z                 op = torch.compile(op)
2025-05-07T20:32:44.2604096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2604464Z     
2025-05-07T20:32:44.2604737Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2605126Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2605531Z     
2025-05-07T20:32:44.2605865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2606327Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2606735Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2607171Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2607655Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2608085Z     
2025-05-07T20:32:44.2608359Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2608641Z 
2025-05-07T20:32:44.2608778Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2609186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2609639Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2610081Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2611127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2612144Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2612887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2613829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2614759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2615746Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2616701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2617561Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2618372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2619052Z     fn()
2025-05-07T20:32:44.2619722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2620492Z     self.fn.run(
2025-05-07T20:32:44.2621114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2621778Z     kernel = self.compile(
2025-05-07T20:32:44.2622458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2623336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2623865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2624166Z 
2025-05-07T20:32:44.2624441Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ebc3e0>
2025-05-07T20:32:44.2625799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2627609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c0535ac00>}
2025-05-07T20:32:44.2629462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2630872Z context = <triton._C.libtriton.ir.context object at 0x7f5c04d3ef30>
2025-05-07T20:32:44.2631252Z 
2025-05-07T20:32:44.2631477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2632226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2632805Z                            module_map=module_map)
2025-05-07T20:32:44.2633250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2633690Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2634016Z E       ^
2025-05-07T20:32:44.2634586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2635146Z 
2025-05-07T20:32:44.2635666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2636500Z 
2025-05-07T20:32:44.2636655Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2637225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2637803Z     T=16384,
2025-05-07T20:32:44.2638082Z     D=7168,
2025-05-07T20:32:44.2638356Z     scale_ub=1200.0,
2025-05-07T20:32:44.2638663Z     contiguous=False,
2025-05-07T20:32:44.2638980Z     compiled=False,
2025-05-07T20:32:44.2639253Z )
2025-05-07T20:32:44.2639697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2640369Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.2640749Z 
2025-05-07T20:32:44.2640857Z     @given(
2025-05-07T20:32:44.2641174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2641605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2642023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2642472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2642920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2643317Z     )
2025-05-07T20:32:44.2643800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2644424Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2644768Z         self,
2025-05-07T20:32:44.2645041Z         T: int,
2025-05-07T20:32:44.2645325Z         D: int,
2025-05-07T20:32:44.2645635Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2646011Z         contiguous: bool,
2025-05-07T20:32:44.2646349Z         compiled: bool,
2025-05-07T20:32:44.2646669Z     ) -> None:
2025-05-07T20:32:44.2646965Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2647309Z     
2025-05-07T20:32:44.2647689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2648170Z     
2025-05-07T20:32:44.2648419Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2648804Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2649230Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2649555Z         x0 = x[:, :D]
2025-05-07T20:32:44.2649853Z         x1 = x[:, D:]
2025-05-07T20:32:44.2650139Z     
2025-05-07T20:32:44.2650390Z         if contiguous:
2025-05-07T20:32:44.2650710Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2651064Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2651385Z     
2025-05-07T20:32:44.2651653Z         if scale_ub is not None:
2025-05-07T20:32:44.2652027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2652466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2652895Z             )
2025-05-07T20:32:44.2653171Z         else:
2025-05-07T20:32:44.2653465Z             scale_ub_tensor = None
2025-05-07T20:32:44.2653827Z     
2025-05-07T20:32:44.2654147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2654578Z             op = silu_mul_quant
2025-05-07T20:32:44.2654918Z             if compiled:
2025-05-07T20:32:44.2655383Z                 op = torch.compile(op)
2025-05-07T20:32:44.2655779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2656137Z     
2025-05-07T20:32:44.2656403Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2656757Z 
2025-05-07T20:32:44.2656900Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2657297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2657749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2658131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2659040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2659962Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2660683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2661599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2662480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2663203Z     kernel = self.compile(
2025-05-07T20:32:44.2663944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2664827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2665643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2665973Z 
2025-05-07T20:32:44.2666247Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ebf3e0>
2025-05-07T20:32:44.2667704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2669565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05583ce0>}
2025-05-07T20:32:44.2671406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2672832Z context = <triton._C.libtriton.ir.context object at 0x7f5bea7563b0>
2025-05-07T20:32:44.2673229Z 
2025-05-07T20:32:44.2673457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2674180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2674801Z                            module_map=module_map)
2025-05-07T20:32:44.2675291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2675863Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2676221Z E       ^
2025-05-07T20:32:44.2676861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2677480Z 
2025-05-07T20:32:44.2678033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2678720Z 
2025-05-07T20:32:44.2678871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2679420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2679974Z     T=1,
2025-05-07T20:32:44.2680234Z     D=7168,
2025-05-07T20:32:44.2680502Z     scale_ub=None,
2025-05-07T20:32:44.2680805Z     contiguous=True,
2025-05-07T20:32:44.2681118Z     compiled=True,
2025-05-07T20:32:44.2681404Z )
2025-05-07T20:32:44.2681854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2682520Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2682871Z 
2025-05-07T20:32:44.2683245Z     @given(
2025-05-07T20:32:44.2683571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2684014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2684599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2685041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2685480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2685866Z     )
2025-05-07T20:32:44.2686325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2686922Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2687277Z         self,
2025-05-07T20:32:44.2687570Z         T: int,
2025-05-07T20:32:44.2687850Z         D: int,
2025-05-07T20:32:44.2688171Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2688513Z         contiguous: bool,
2025-05-07T20:32:44.2688828Z         compiled: bool,
2025-05-07T20:32:44.2689140Z     ) -> None:
2025-05-07T20:32:44.2689455Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2689804Z     
2025-05-07T20:32:44.2690186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2690665Z     
2025-05-07T20:32:44.2690943Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2691347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2691784Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2692118Z         x0 = x[:, :D]
2025-05-07T20:32:44.2692427Z         x1 = x[:, D:]
2025-05-07T20:32:44.2692733Z     
2025-05-07T20:32:44.2692993Z         if contiguous:
2025-05-07T20:32:44.2693314Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2693677Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2694022Z     
2025-05-07T20:32:44.2694293Z         if scale_ub is not None:
2025-05-07T20:32:44.2694668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2695125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2695551Z             )
2025-05-07T20:32:44.2695834Z         else:
2025-05-07T20:32:44.2696140Z             scale_ub_tensor = None
2025-05-07T20:32:44.2696497Z     
2025-05-07T20:32:44.2696825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2697275Z             op = silu_mul_quant
2025-05-07T20:32:44.2697623Z             if compiled:
2025-05-07T20:32:44.2697976Z                 op = torch.compile(op)
2025-05-07T20:32:44.2698394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2698779Z     
2025-05-07T20:32:44.2699061Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2699461Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2699854Z     
2025-05-07T20:32:44.2700184Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2700644Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2701046Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2701468Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2701961Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2702394Z     
2025-05-07T20:32:44.2702680Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2702963Z 
2025-05-07T20:32:44.2703109Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2703532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2704018Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2704488Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2705580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2706599Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2707323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2708371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2709320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2710311Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2711430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2712317Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2713152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2713862Z     fn()
2025-05-07T20:32:44.2714569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2715386Z     self.fn.run(
2025-05-07T20:32:44.2716175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2716937Z     kernel = self.compile(
2025-05-07T20:32:44.2717687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2718425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2718823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2719076Z 
2025-05-07T20:32:44.2719318Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0537cc20>
2025-05-07T20:32:44.2720408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2721788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04fac720>}
2025-05-07T20:32:44.2723134Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2724158Z context = <triton._C.libtriton.ir.context object at 0x7f5bea600d30>
2025-05-07T20:32:44.2724454Z 
2025-05-07T20:32:44.2724622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2725145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2725609Z                            module_map=module_map)
2025-05-07T20:32:44.2725975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2726336Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2726605Z E       ^
2025-05-07T20:32:44.2727064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2727525Z 
2025-05-07T20:32:44.2727936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2728443Z 
2025-05-07T20:32:44.2728559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2728974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2729375Z     T=4096,
2025-05-07T20:32:44.2729573Z     D=5120,
2025-05-07T20:32:44.2729772Z     scale_ub=None,
2025-05-07T20:32:44.2729986Z     contiguous=False,
2025-05-07T20:32:44.2730215Z     compiled=False,
2025-05-07T20:32:44.2730425Z )
2025-05-07T20:32:44.2730743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2731238Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.2731514Z 
2025-05-07T20:32:44.2731604Z     @given(
2025-05-07T20:32:44.2731835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2732266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2732584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2732922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2733328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2733619Z     )
2025-05-07T20:32:44.2733968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2734406Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2734656Z         self,
2025-05-07T20:32:44.2734863Z         T: int,
2025-05-07T20:32:44.2735059Z         D: int,
2025-05-07T20:32:44.2735283Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2735559Z         contiguous: bool,
2025-05-07T20:32:44.2735800Z         compiled: bool,
2025-05-07T20:32:44.2736031Z     ) -> None:
2025-05-07T20:32:44.2736251Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2736494Z     
2025-05-07T20:32:44.2736778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2737133Z     
2025-05-07T20:32:44.2737332Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2737634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2737968Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2738217Z         x0 = x[:, :D]
2025-05-07T20:32:44.2738437Z         x1 = x[:, D:]
2025-05-07T20:32:44.2738655Z     
2025-05-07T20:32:44.2738851Z         if contiguous:
2025-05-07T20:32:44.2739088Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2739356Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2739603Z     
2025-05-07T20:32:44.2739799Z         if scale_ub is not None:
2025-05-07T20:32:44.2740079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2740422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2740735Z             )
2025-05-07T20:32:44.2740936Z         else:
2025-05-07T20:32:44.2741160Z             scale_ub_tensor = None
2025-05-07T20:32:44.2741413Z     
2025-05-07T20:32:44.2741657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2741981Z             op = silu_mul_quant
2025-05-07T20:32:44.2742234Z             if compiled:
2025-05-07T20:32:44.2742492Z                 op = torch.compile(op)
2025-05-07T20:32:44.2742801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2743083Z     
2025-05-07T20:32:44.2743281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2743454Z 
2025-05-07T20:32:44.2743558Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2743861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2752550Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2752882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2753582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2754272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2754828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2755517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2756361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2756893Z     kernel = self.compile(
2025-05-07T20:32:44.2757441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2757617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2757753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2757759Z 
2025-05-07T20:32:44.2757965Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c051b6b10>
2025-05-07T20:32:44.2758912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2759472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c49a0>}
2025-05-07T20:32:44.2760291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2760492Z context = <triton._C.libtriton.ir.context object at 0x7f5bea629cf0>
2025-05-07T20:32:44.2760497Z 
2025-05-07T20:32:44.2760664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2760936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2761050Z                            module_map=module_map)
2025-05-07T20:32:44.2761221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2761327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2761404Z E       ^
2025-05-07T20:32:44.2761769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2761779Z 
2025-05-07T20:32:44.2762192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2762197Z 
2025-05-07T20:32:44.2762308Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2762534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2762615Z     T=4096,
2025-05-07T20:32:44.2762703Z     D=7168,
2025-05-07T20:32:44.2762789Z     scale_ub=None,
2025-05-07T20:32:44.2762879Z     contiguous=False,
2025-05-07T20:32:44.2762979Z     compiled=False,
2025-05-07T20:32:44.2763058Z )
2025-05-07T20:32:44.2763282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2763472Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.2763477Z 
2025-05-07T20:32:44.2763566Z     @given(
2025-05-07T20:32:44.2763696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2763799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2763917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2764048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2764168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2764249Z     )
2025-05-07T20:32:44.2764501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2764597Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2764688Z         self,
2025-05-07T20:32:44.2764769Z         T: int,
2025-05-07T20:32:44.2764850Z         D: int,
2025-05-07T20:32:44.2764959Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2765057Z         contiguous: bool,
2025-05-07T20:32:44.2765146Z         compiled: bool,
2025-05-07T20:32:44.2765236Z     ) -> None:
2025-05-07T20:32:44.2765334Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2765758Z     
2025-05-07T20:32:44.2765981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2766059Z     
2025-05-07T20:32:44.2766155Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2766305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2766433Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2766542Z         x0 = x[:, :D]
2025-05-07T20:32:44.2766635Z         x1 = x[:, D:]
2025-05-07T20:32:44.2766712Z     
2025-05-07T20:32:44.2766806Z         if contiguous:
2025-05-07T20:32:44.2766902Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2766996Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2767078Z     
2025-05-07T20:32:44.2767172Z         if scale_ub is not None:
2025-05-07T20:32:44.2767500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2767648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2767726Z             )
2025-05-07T20:32:44.2767924Z         else:
2025-05-07T20:32:44.2768030Z             scale_ub_tensor = None
2025-05-07T20:32:44.2768105Z     
2025-05-07T20:32:44.2768236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2768336Z             op = silu_mul_quant
2025-05-07T20:32:44.2768423Z             if compiled:
2025-05-07T20:32:44.2768533Z                 op = torch.compile(op)
2025-05-07T20:32:44.2768639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2768716Z     
2025-05-07T20:32:44.2768818Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2768823Z 
2025-05-07T20:32:44.2768922Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2769055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2769182Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2769301Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2769825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2769942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2770300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2770532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2770872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2770969Z     kernel = self.compile(
2025-05-07T20:32:44.2771359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2771537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2771678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2771683Z 
2025-05-07T20:32:44.2771888Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0537ec30>
2025-05-07T20:32:44.2772672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2773181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c4e00>}
2025-05-07T20:32:44.2773924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2774124Z context = <triton._C.libtriton.ir.context object at 0x7f5be9f80430>
2025-05-07T20:32:44.2774135Z 
2025-05-07T20:32:44.2774300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2774564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2774683Z                            module_map=module_map)
2025-05-07T20:32:44.2774847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2774955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2775036Z E       ^
2025-05-07T20:32:44.2775391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2775396Z 
2025-05-07T20:32:44.2775813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2775818Z 
2025-05-07T20:32:44.2775923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2776236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2776318Z     T=128,
2025-05-07T20:32:44.2776397Z     D=7168,
2025-05-07T20:32:44.2776488Z     scale_ub=None,
2025-05-07T20:32:44.2776580Z     contiguous=False,
2025-05-07T20:32:44.2776742Z     compiled=True,
2025-05-07T20:32:44.2776824Z )
2025-05-07T20:32:44.2777045Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2777217Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.2777222Z 
2025-05-07T20:32:44.2777309Z     @given(
2025-05-07T20:32:44.2777431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2777537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2777666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2777785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2777906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2777984Z     )
2025-05-07T20:32:44.2778233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2778338Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2778417Z         self,
2025-05-07T20:32:44.2778505Z         T: int,
2025-05-07T20:32:44.2778594Z         D: int,
2025-05-07T20:32:44.2778698Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2778791Z         contiguous: bool,
2025-05-07T20:32:44.2778890Z         compiled: bool,
2025-05-07T20:32:44.2778972Z     ) -> None:
2025-05-07T20:32:44.2779081Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2779175Z     
2025-05-07T20:32:44.2779371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2779455Z     
2025-05-07T20:32:44.2779549Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2779676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2779773Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2779856Z         x0 = x[:, :D]
2025-05-07T20:32:44.2779945Z         x1 = x[:, D:]
2025-05-07T20:32:44.2780027Z     
2025-05-07T20:32:44.2780113Z         if contiguous:
2025-05-07T20:32:44.2780207Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2780307Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2780387Z     
2025-05-07T20:32:44.2780482Z         if scale_ub is not None:
2025-05-07T20:32:44.2780598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2780734Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2780821Z             )
2025-05-07T20:32:44.2780900Z         else:
2025-05-07T20:32:44.2780997Z             scale_ub_tensor = None
2025-05-07T20:32:44.2781083Z     
2025-05-07T20:32:44.2781215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2781309Z             op = silu_mul_quant
2025-05-07T20:32:44.2781404Z             if compiled:
2025-05-07T20:32:44.2781505Z                 op = torch.compile(op)
2025-05-07T20:32:44.2781614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2781701Z     
2025-05-07T20:32:44.2781796Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2781918Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2782001Z     
2025-05-07T20:32:44.2782145Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2782259Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2782361Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2782486Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2782633Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2782710Z     
2025-05-07T20:32:44.2782815Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2782819Z 
2025-05-07T20:32:44.2782927Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2783059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2783175Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2783400Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2783963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2784867Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2785229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2785451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2785828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2786085Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2786467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2786640Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2786982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2787072Z     fn()
2025-05-07T20:32:44.2787476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2787562Z     self.fn.run(
2025-05-07T20:32:44.2787907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2788003Z     kernel = self.compile(
2025-05-07T20:32:44.2788395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2788569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2788700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2788704Z 
2025-05-07T20:32:44.2788925Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4cecf0>
2025-05-07T20:32:44.2789755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2790271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c5a80>}
2025-05-07T20:32:44.2791013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2791206Z context = <triton._C.libtriton.ir.context object at 0x7f5bea48b8b0>
2025-05-07T20:32:44.2791221Z 
2025-05-07T20:32:44.2791387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2791656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2791776Z                            module_map=module_map)
2025-05-07T20:32:44.2791938Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2792046Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2792132Z E       ^
2025-05-07T20:32:44.2792490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2792494Z 
2025-05-07T20:32:44.2792911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2792915Z 
2025-05-07T20:32:44.2793021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2793247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2793333Z     T=128,
2025-05-07T20:32:44.2793414Z     D=7168,
2025-05-07T20:32:44.2793625Z     scale_ub=None,
2025-05-07T20:32:44.2793727Z     contiguous=False,
2025-05-07T20:32:44.2793815Z     compiled=False,
2025-05-07T20:32:44.2793894Z )
2025-05-07T20:32:44.2794122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2794373Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.2794378Z 
2025-05-07T20:32:44.2794469Z     @given(
2025-05-07T20:32:44.2794591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2794692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2794818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2794937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2795053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2795138Z     )
2025-05-07T20:32:44.2795387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2795490Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2795578Z         self,
2025-05-07T20:32:44.2795659Z         T: int,
2025-05-07T20:32:44.2795810Z         D: int,
2025-05-07T20:32:44.2795913Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2796010Z         contiguous: bool,
2025-05-07T20:32:44.2796107Z         compiled: bool,
2025-05-07T20:32:44.2796188Z     ) -> None:
2025-05-07T20:32:44.2796287Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2796373Z     
2025-05-07T20:32:44.2796544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2796623Z     
2025-05-07T20:32:44.2796725Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2796851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2796944Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2797035Z         x0 = x[:, :D]
2025-05-07T20:32:44.2797117Z         x1 = x[:, D:]
2025-05-07T20:32:44.2797202Z     
2025-05-07T20:32:44.2797288Z         if contiguous:
2025-05-07T20:32:44.2797383Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2797485Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2797562Z     
2025-05-07T20:32:44.2797656Z         if scale_ub is not None:
2025-05-07T20:32:44.2797772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2797913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2797993Z             )
2025-05-07T20:32:44.2798080Z         else:
2025-05-07T20:32:44.2798178Z             scale_ub_tensor = None
2025-05-07T20:32:44.2798254Z     
2025-05-07T20:32:44.2798392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2798483Z             op = silu_mul_quant
2025-05-07T20:32:44.2798580Z             if compiled:
2025-05-07T20:32:44.2798683Z                 op = torch.compile(op)
2025-05-07T20:32:44.2798794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2798876Z     
2025-05-07T20:32:44.2798978Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2798982Z 
2025-05-07T20:32:44.2799085Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2799217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2799326Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2799434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2799933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2800038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2800396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2800624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2800961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2801060Z     kernel = self.compile(
2025-05-07T20:32:44.2801534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2801712Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2801847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2801928Z 
2025-05-07T20:32:44.2802134Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c0512a930>
2025-05-07T20:32:44.2802912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2803421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19c540>}
2025-05-07T20:32:44.2804171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2804369Z context = <triton._C.libtriton.ir.context object at 0x7f5be9bc9670>
2025-05-07T20:32:44.2804381Z 
2025-05-07T20:32:44.2804547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2804811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2804925Z                            module_map=module_map)
2025-05-07T20:32:44.2805088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2805195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2805274Z E       ^
2025-05-07T20:32:44.2805629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2805634Z 
2025-05-07T20:32:44.2806055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2806060Z 
2025-05-07T20:32:44.2806166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2806396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2806481Z     T=4096,
2025-05-07T20:32:44.2806560Z     D=5120,
2025-05-07T20:32:44.2806650Z     scale_ub=1200.0,
2025-05-07T20:32:44.2806735Z     contiguous=True,
2025-05-07T20:32:44.2806820Z     compiled=False,
2025-05-07T20:32:44.2806901Z )
2025-05-07T20:32:44.2807119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2807297Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.2807301Z 
2025-05-07T20:32:44.2807385Z     @given(
2025-05-07T20:32:44.2807506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2807605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2807727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2807850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2807969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2808045Z     )
2025-05-07T20:32:44.2808292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2808398Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2808476Z         self,
2025-05-07T20:32:44.2808554Z         T: int,
2025-05-07T20:32:44.2808639Z         D: int,
2025-05-07T20:32:44.2808739Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2808830Z         contiguous: bool,
2025-05-07T20:32:44.2808923Z         compiled: bool,
2025-05-07T20:32:44.2809002Z     ) -> None:
2025-05-07T20:32:44.2809111Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2809203Z     
2025-05-07T20:32:44.2809396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2809479Z     
2025-05-07T20:32:44.2809571Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2809779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2809877Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2809960Z         x0 = x[:, :D]
2025-05-07T20:32:44.2810042Z         x1 = x[:, D:]
2025-05-07T20:32:44.2810236Z     
2025-05-07T20:32:44.2810325Z         if contiguous:
2025-05-07T20:32:44.2810417Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2810513Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2810587Z     
2025-05-07T20:32:44.2810685Z         if scale_ub is not None:
2025-05-07T20:32:44.2810798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2810933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2811017Z             )
2025-05-07T20:32:44.2811098Z         else:
2025-05-07T20:32:44.2811198Z             scale_ub_tensor = None
2025-05-07T20:32:44.2811278Z     
2025-05-07T20:32:44.2811409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2811502Z             op = silu_mul_quant
2025-05-07T20:32:44.2811598Z             if compiled:
2025-05-07T20:32:44.2811698Z                 op = torch.compile(op)
2025-05-07T20:32:44.2811805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2811890Z     
2025-05-07T20:32:44.2811991Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2811996Z 
2025-05-07T20:32:44.2812101Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2812232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2812334Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2812442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2812940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2813038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2813403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2813630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2813974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2814078Z     kernel = self.compile(
2025-05-07T20:32:44.2814458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2814641Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2814772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2814776Z 
2025-05-07T20:32:44.2814986Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beab834a0>
2025-05-07T20:32:44.2815769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2816277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19e480>}
2025-05-07T20:32:44.2817032Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2817225Z context = <triton._C.libtriton.ir.context object at 0x7f5be9c0b770>
2025-05-07T20:32:44.2817230Z 
2025-05-07T20:32:44.2817402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2817667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2817775Z                            module_map=module_map)
2025-05-07T20:32:44.2817950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2818054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2818215Z E       ^
2025-05-07T20:32:44.2818578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2818681Z 
2025-05-07T20:32:44.2819093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2819098Z 
2025-05-07T20:32:44.2819213Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2819439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2819518Z     T=1,
2025-05-07T20:32:44.2819603Z     D=5120,
2025-05-07T20:32:44.2819687Z     scale_ub=None,
2025-05-07T20:32:44.2819775Z     contiguous=True,
2025-05-07T20:32:44.2819866Z     compiled=True,
2025-05-07T20:32:44.2819943Z )
2025-05-07T20:32:44.2820168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2820341Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2820346Z 
2025-05-07T20:32:44.2820425Z     @given(
2025-05-07T20:32:44.2820550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2820651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2820772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2820897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2821011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2821087Z     )
2025-05-07T20:32:44.2821337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2821432Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2821518Z         self,
2025-05-07T20:32:44.2821596Z         T: int,
2025-05-07T20:32:44.2821675Z         D: int,
2025-05-07T20:32:44.2821780Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2821876Z         contiguous: bool,
2025-05-07T20:32:44.2821967Z         compiled: bool,
2025-05-07T20:32:44.2822053Z     ) -> None:
2025-05-07T20:32:44.2822154Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2822229Z     
2025-05-07T20:32:44.2822408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2822490Z     
2025-05-07T20:32:44.2822584Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2822715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2822806Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2822894Z         x0 = x[:, :D]
2025-05-07T20:32:44.2822975Z         x1 = x[:, D:]
2025-05-07T20:32:44.2823049Z     
2025-05-07T20:32:44.2823139Z         if contiguous:
2025-05-07T20:32:44.2823233Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2823323Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2823404Z     
2025-05-07T20:32:44.2823496Z         if scale_ub is not None:
2025-05-07T20:32:44.2823607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2823746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2823829Z             )
2025-05-07T20:32:44.2823908Z         else:
2025-05-07T20:32:44.2824008Z             scale_ub_tensor = None
2025-05-07T20:32:44.2824086Z     
2025-05-07T20:32:44.2824215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2824318Z             op = silu_mul_quant
2025-05-07T20:32:44.2824402Z             if compiled:
2025-05-07T20:32:44.2824509Z                 op = torch.compile(op)
2025-05-07T20:32:44.2824616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2824690Z     
2025-05-07T20:32:44.2824792Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2824915Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2824989Z     
2025-05-07T20:32:44.2825133Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2825236Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2825342Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2825552Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2825693Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2825775Z     
2025-05-07T20:32:44.2825877Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2825955Z 
2025-05-07T20:32:44.2826056Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2826192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2826300Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2826436Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2827000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2827102Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2827466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2827692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2828061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2828331Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2828703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2828870Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2829216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2829295Z     fn()
2025-05-07T20:32:44.2829698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2829783Z     self.fn.run(
2025-05-07T20:32:44.2830123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2830226Z     kernel = self.compile(
2025-05-07T20:32:44.2830605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2830792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2830924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2830928Z 
2025-05-07T20:32:44.2831135Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beab83560>
2025-05-07T20:32:44.2831921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2832429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05058c20>}
2025-05-07T20:32:44.2833179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2833376Z context = <triton._C.libtriton.ir.context object at 0x7f5be9c46a30>
2025-05-07T20:32:44.2833381Z 
2025-05-07T20:32:44.2833547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2833817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2833927Z                            module_map=module_map)
2025-05-07T20:32:44.2834097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2834201Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2834281Z E       ^
2025-05-07T20:32:44.2834726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2834731Z 
2025-05-07T20:32:44.2835146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2835227Z 
2025-05-07T20:32:44.2835339Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2835564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2835643Z     T=2048,
2025-05-07T20:32:44.2835782Z     D=5120,
2025-05-07T20:32:44.2835867Z     scale_ub=None,
2025-05-07T20:32:44.2835956Z     contiguous=True,
2025-05-07T20:32:44.2836048Z     compiled=True,
2025-05-07T20:32:44.2836123Z )
2025-05-07T20:32:44.2836344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2836522Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2836526Z 
2025-05-07T20:32:44.2836605Z     @given(
2025-05-07T20:32:44.2836732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2836838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2836955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2837078Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2837198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2837273Z     )
2025-05-07T20:32:44.2837522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2837615Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2837693Z         self,
2025-05-07T20:32:44.2837777Z         T: int,
2025-05-07T20:32:44.2837858Z         D: int,
2025-05-07T20:32:44.2837959Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2838055Z         contiguous: bool,
2025-05-07T20:32:44.2838142Z         compiled: bool,
2025-05-07T20:32:44.2838227Z     ) -> None:
2025-05-07T20:32:44.2838323Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2838397Z     
2025-05-07T20:32:44.2838574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2838653Z     
2025-05-07T20:32:44.2838746Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2838877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2838975Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2839060Z         x0 = x[:, :D]
2025-05-07T20:32:44.2839148Z         x1 = x[:, D:]
2025-05-07T20:32:44.2839222Z     
2025-05-07T20:32:44.2839307Z         if contiguous:
2025-05-07T20:32:44.2839409Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2839500Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2839574Z     
2025-05-07T20:32:44.2839675Z         if scale_ub is not None:
2025-05-07T20:32:44.2839781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2839925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2840003Z             )
2025-05-07T20:32:44.2840083Z         else:
2025-05-07T20:32:44.2840183Z             scale_ub_tensor = None
2025-05-07T20:32:44.2840262Z     
2025-05-07T20:32:44.2840396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2840499Z             op = silu_mul_quant
2025-05-07T20:32:44.2840586Z             if compiled:
2025-05-07T20:32:44.2840691Z                 op = torch.compile(op)
2025-05-07T20:32:44.2840805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2840880Z     
2025-05-07T20:32:44.2840973Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2841101Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2841175Z     
2025-05-07T20:32:44.2841317Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2841420Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2841522Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2841655Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2841793Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2841872Z     
2025-05-07T20:32:44.2842067Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2842072Z 
2025-05-07T20:32:44.2842175Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2842311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2842494Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2842630Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2843195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2843298Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2843659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2843893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2844263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2844526Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2844898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2845070Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2845416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2845495Z     fn()
2025-05-07T20:32:44.2845893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2845986Z     self.fn.run(
2025-05-07T20:32:44.2846323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2846423Z     kernel = self.compile(
2025-05-07T20:32:44.2846807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2846981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2847118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2847127Z 
2025-05-07T20:32:44.2847335Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4ce2a0>
2025-05-07T20:32:44.2848114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2848617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea1553a0>}
2025-05-07T20:32:44.2849405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2849612Z context = <triton._C.libtriton.ir.context object at 0x7f5be9afdab0>
2025-05-07T20:32:44.2849621Z 
2025-05-07T20:32:44.2849787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2850055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2850164Z                            module_map=module_map)
2025-05-07T20:32:44.2850326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2850437Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2850515Z E       ^
2025-05-07T20:32:44.2850877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2850881Z 
2025-05-07T20:32:44.2851374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2851379Z 
2025-05-07T20:32:44.2851486Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2851716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2851896Z     T=128,
2025-05-07T20:32:44.2851975Z     D=5120,
2025-05-07T20:32:44.2852065Z     scale_ub=None,
2025-05-07T20:32:44.2852153Z     contiguous=True,
2025-05-07T20:32:44.2852243Z     compiled=True,
2025-05-07T20:32:44.2852319Z )
2025-05-07T20:32:44.2852535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2852709Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2852713Z 
2025-05-07T20:32:44.2852792Z     @given(
2025-05-07T20:32:44.2852913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2853018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2853134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2853256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2853378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2853455Z     )
2025-05-07T20:32:44.2853710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2853805Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2853884Z         self,
2025-05-07T20:32:44.2853971Z         T: int,
2025-05-07T20:32:44.2854049Z         D: int,
2025-05-07T20:32:44.2854148Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2854245Z         contiguous: bool,
2025-05-07T20:32:44.2854333Z         compiled: bool,
2025-05-07T20:32:44.2854412Z     ) -> None:
2025-05-07T20:32:44.2854518Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2854595Z     
2025-05-07T20:32:44.2854763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2854846Z     
2025-05-07T20:32:44.2854939Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2855073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2855168Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2855255Z         x0 = x[:, :D]
2025-05-07T20:32:44.2855342Z         x1 = x[:, D:]
2025-05-07T20:32:44.2855422Z     
2025-05-07T20:32:44.2855507Z         if contiguous:
2025-05-07T20:32:44.2855605Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2855694Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2855769Z     
2025-05-07T20:32:44.2855872Z         if scale_ub is not None:
2025-05-07T20:32:44.2855979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2856115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2856199Z             )
2025-05-07T20:32:44.2856281Z         else:
2025-05-07T20:32:44.2856379Z             scale_ub_tensor = None
2025-05-07T20:32:44.2856460Z     
2025-05-07T20:32:44.2856593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2856686Z             op = silu_mul_quant
2025-05-07T20:32:44.2856784Z             if compiled:
2025-05-07T20:32:44.2856885Z                 op = torch.compile(op)
2025-05-07T20:32:44.2857001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2857080Z     
2025-05-07T20:32:44.2857172Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2857298Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2857374Z     
2025-05-07T20:32:44.2857510Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2857616Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2857717Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2857838Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2857983Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2858061Z     
2025-05-07T20:32:44.2858163Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2858173Z 
2025-05-07T20:32:44.2858272Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2858489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2858604Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2858738Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2859412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2859529Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2859887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2860115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2860478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2860733Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2861116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2861285Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2861628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2861715Z     fn()
2025-05-07T20:32:44.2862112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2862201Z     self.fn.run(
2025-05-07T20:32:44.2862538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2862632Z     kernel = self.compile(
2025-05-07T20:32:44.2863015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2863196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2863325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2863335Z 
2025-05-07T20:32:44.2863544Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea146f90>
2025-05-07T20:32:44.2864324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2864838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be93a1ee0>}
2025-05-07T20:32:44.2865875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2866087Z context = <triton._C.libtriton.ir.context object at 0x7f5be98d7730>
2025-05-07T20:32:44.2866092Z 
2025-05-07T20:32:44.2866258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2866524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2866646Z                            module_map=module_map)
2025-05-07T20:32:44.2866808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2866919Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2866998Z E       ^
2025-05-07T20:32:44.2867355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2867360Z 
2025-05-07T20:32:44.2867777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2867782Z 
2025-05-07T20:32:44.2867890Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2868292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2868374Z     T=4096,
2025-05-07T20:32:44.2868453Z     D=5120,
2025-05-07T20:32:44.2868546Z     scale_ub=None,
2025-05-07T20:32:44.2868748Z     contiguous=True,
2025-05-07T20:32:44.2868837Z     compiled=True,
2025-05-07T20:32:44.2868918Z )
2025-05-07T20:32:44.2869137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2869311Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2869316Z 
2025-05-07T20:32:44.2869399Z     @given(
2025-05-07T20:32:44.2869519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2869623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2869746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2869864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2869989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2870071Z     )
2025-05-07T20:32:44.2870316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2870421Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2870504Z         self,
2025-05-07T20:32:44.2870583Z         T: int,
2025-05-07T20:32:44.2870674Z         D: int,
2025-05-07T20:32:44.2870776Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2870866Z         contiguous: bool,
2025-05-07T20:32:44.2870961Z         compiled: bool,
2025-05-07T20:32:44.2871045Z     ) -> None:
2025-05-07T20:32:44.2871142Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2871227Z     
2025-05-07T20:32:44.2871396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2871478Z     
2025-05-07T20:32:44.2871571Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2871696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2871792Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2871879Z         x0 = x[:, :D]
2025-05-07T20:32:44.2871962Z         x1 = x[:, D:]
2025-05-07T20:32:44.2872044Z     
2025-05-07T20:32:44.2872129Z         if contiguous:
2025-05-07T20:32:44.2872221Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2872323Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2872398Z     
2025-05-07T20:32:44.2872489Z         if scale_ub is not None:
2025-05-07T20:32:44.2872604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2872740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2872827Z             )
2025-05-07T20:32:44.2872905Z         else:
2025-05-07T20:32:44.2873002Z             scale_ub_tensor = None
2025-05-07T20:32:44.2873091Z     
2025-05-07T20:32:44.2873220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2873312Z             op = silu_mul_quant
2025-05-07T20:32:44.2873406Z             if compiled:
2025-05-07T20:32:44.2873507Z                 op = torch.compile(op)
2025-05-07T20:32:44.2873621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2873700Z     
2025-05-07T20:32:44.2873792Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2873917Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2874001Z     
2025-05-07T20:32:44.2874136Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2874243Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2874344Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2874465Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2874609Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2874684Z     
2025-05-07T20:32:44.2874789Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2874794Z 
2025-05-07T20:32:44.2874898Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2875028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2875135Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2875361Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2875976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2876161Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2876521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2876744Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2877118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2877374Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2877753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2877924Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2878265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2878355Z     fn()
2025-05-07T20:32:44.2878755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2878840Z     self.fn.run(
2025-05-07T20:32:44.2879187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2879282Z     kernel = self.compile(
2025-05-07T20:32:44.2879665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2879852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2879982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2879986Z 
2025-05-07T20:32:44.2880202Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9c98ec0>
2025-05-07T20:32:44.2885821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2886370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be9722660>}
2025-05-07T20:32:44.2887121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2887323Z context = <triton._C.libtriton.ir.context object at 0x7f5be8ff9230>
2025-05-07T20:32:44.2887328Z 
2025-05-07T20:32:44.2887503Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2887767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2887885Z                            module_map=module_map)
2025-05-07T20:32:44.2888055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2888166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2888248Z E       ^
2025-05-07T20:32:44.2888604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2888608Z 
2025-05-07T20:32:44.2889027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2889031Z 
2025-05-07T20:32:44.2889138Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2889368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2889450Z     T=16384,
2025-05-07T20:32:44.2889665Z     D=5120,
2025-05-07T20:32:44.2889761Z     scale_ub=None,
2025-05-07T20:32:44.2889849Z     contiguous=True,
2025-05-07T20:32:44.2889934Z     compiled=True,
2025-05-07T20:32:44.2890019Z )
2025-05-07T20:32:44.2890318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2890492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.2890496Z 
2025-05-07T20:32:44.2890582Z     @given(
2025-05-07T20:32:44.2890705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2890816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2890934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2891055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2891177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2891254Z     )
2025-05-07T20:32:44.2891503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2891606Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2891686Z         self,
2025-05-07T20:32:44.2891772Z         T: int,
2025-05-07T20:32:44.2891854Z         D: int,
2025-05-07T20:32:44.2891956Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2892064Z         contiguous: bool,
2025-05-07T20:32:44.2892154Z         compiled: bool,
2025-05-07T20:32:44.2892248Z     ) -> None:
2025-05-07T20:32:44.2892345Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2892421Z     
2025-05-07T20:32:44.2892599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2892676Z     
2025-05-07T20:32:44.2892773Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2892907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2893003Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2893089Z         x0 = x[:, :D]
2025-05-07T20:32:44.2893179Z         x1 = x[:, D:]
2025-05-07T20:32:44.2893257Z     
2025-05-07T20:32:44.2893346Z         if contiguous:
2025-05-07T20:32:44.2893456Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2893549Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2893625Z     
2025-05-07T20:32:44.2893729Z         if scale_ub is not None:
2025-05-07T20:32:44.2893842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2893990Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2894071Z             )
2025-05-07T20:32:44.2894153Z         else:
2025-05-07T20:32:44.2894255Z             scale_ub_tensor = None
2025-05-07T20:32:44.2894336Z     
2025-05-07T20:32:44.2894469Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2894570Z             op = silu_mul_quant
2025-05-07T20:32:44.2894658Z             if compiled:
2025-05-07T20:32:44.2894762Z                 op = torch.compile(op)
2025-05-07T20:32:44.2894881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2894959Z     
2025-05-07T20:32:44.2895053Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2895189Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2895269Z     
2025-05-07T20:32:44.2895413Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2895518Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2895624Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2895754Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2895896Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2895974Z     
2025-05-07T20:32:44.2896086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2896090Z 
2025-05-07T20:32:44.2896191Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2896325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2896443Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2896578Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2897229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2897336Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2897789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2898019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2898386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2898649Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2899028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2899194Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2899549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2899630Z     fn()
2025-05-07T20:32:44.2900030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2900126Z     self.fn.run(
2025-05-07T20:32:44.2900464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2900566Z     kernel = self.compile(
2025-05-07T20:32:44.2900946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2901121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2901260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2901265Z 
2025-05-07T20:32:44.2901472Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d8fef0>
2025-05-07T20:32:44.2902269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2902779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be91474c0>}
2025-05-07T20:32:44.2903524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2903720Z context = <triton._C.libtriton.ir.context object at 0x7f5be8724bb0>
2025-05-07T20:32:44.2903725Z 
2025-05-07T20:32:44.2903891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2904163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2904278Z                            module_map=module_map)
2025-05-07T20:32:44.2904441Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2904552Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2904636Z E       ^
2025-05-07T20:32:44.2904996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2905000Z 
2025-05-07T20:32:44.2905412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2905417Z 
2025-05-07T20:32:44.2905523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2905752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2905830Z     T=1,
2025-05-07T20:32:44.2905909Z     D=5120,
2025-05-07T20:32:44.2906000Z     scale_ub=1200.0,
2025-05-07T20:32:44.2906087Z     contiguous=True,
2025-05-07T20:32:44.2906173Z     compiled=True,
2025-05-07T20:32:44.2906341Z )
2025-05-07T20:32:44.2906560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2906733Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.2906812Z 
2025-05-07T20:32:44.2906891Z     @given(
2025-05-07T20:32:44.2907014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2907123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2907241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2907360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2907482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2907558Z     )
2025-05-07T20:32:44.2907812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2907910Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2907991Z         self,
2025-05-07T20:32:44.2908077Z         T: int,
2025-05-07T20:32:44.2908162Z         D: int,
2025-05-07T20:32:44.2908263Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2908365Z         contiguous: bool,
2025-05-07T20:32:44.2908457Z         compiled: bool,
2025-05-07T20:32:44.2908547Z     ) -> None:
2025-05-07T20:32:44.2908656Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2908734Z     
2025-05-07T20:32:44.2908905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2908991Z     
2025-05-07T20:32:44.2909087Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2909219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2909311Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2909394Z         x0 = x[:, :D]
2025-05-07T20:32:44.2909488Z         x1 = x[:, D:]
2025-05-07T20:32:44.2909563Z     
2025-05-07T20:32:44.2909652Z         if contiguous:
2025-05-07T20:32:44.2909753Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2909847Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2909924Z     
2025-05-07T20:32:44.2910029Z         if scale_ub is not None:
2025-05-07T20:32:44.2910137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2910274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2910365Z             )
2025-05-07T20:32:44.2910444Z         else:
2025-05-07T20:32:44.2910541Z             scale_ub_tensor = None
2025-05-07T20:32:44.2910623Z     
2025-05-07T20:32:44.2910754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2910855Z             op = silu_mul_quant
2025-05-07T20:32:44.2910942Z             if compiled:
2025-05-07T20:32:44.2911043Z                 op = torch.compile(op)
2025-05-07T20:32:44.2911157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2911233Z     
2025-05-07T20:32:44.2911326Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2911331Z 
2025-05-07T20:32:44.2911437Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2911572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2911678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2911786Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2912153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.2912261Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.2912753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2912853Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2913214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2913440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2913782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2913884Z     kernel = self.compile(
2025-05-07T20:32:44.2914344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2914529Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2914734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2914739Z 
2025-05-07T20:32:44.2914949Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b5a3f0>
2025-05-07T20:32:44.2915798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2916302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8bde5c0>}
2025-05-07T20:32:44.2917055Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2917251Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc0cd30>
2025-05-07T20:32:44.2917255Z 
2025-05-07T20:32:44.2917427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2917691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2917801Z                            module_map=module_map)
2025-05-07T20:32:44.2917970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2918071Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2918150Z E       ^
2025-05-07T20:32:44.2918519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2918523Z 
2025-05-07T20:32:44.2918939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2918944Z 
2025-05-07T20:32:44.2919056Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2919287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2919370Z     T=1,
2025-05-07T20:32:44.2919455Z     D=5120,
2025-05-07T20:32:44.2919543Z     scale_ub=None,
2025-05-07T20:32:44.2919630Z     contiguous=False,
2025-05-07T20:32:44.2919725Z     compiled=True,
2025-05-07T20:32:44.2919801Z )
2025-05-07T20:32:44.2920020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2920191Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.2920195Z 
2025-05-07T20:32:44.2920276Z     @given(
2025-05-07T20:32:44.2920403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2920503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2920625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2920750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2920865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2920946Z     )
2025-05-07T20:32:44.2921196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2921291Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2921374Z         self,
2025-05-07T20:32:44.2921457Z         T: int,
2025-05-07T20:32:44.2921535Z         D: int,
2025-05-07T20:32:44.2921641Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2921735Z         contiguous: bool,
2025-05-07T20:32:44.2921823Z         compiled: bool,
2025-05-07T20:32:44.2921908Z     ) -> None:
2025-05-07T20:32:44.2922004Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2922080Z     
2025-05-07T20:32:44.2922252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2922329Z     
2025-05-07T20:32:44.2922532Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2922667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2922758Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2922917Z         x0 = x[:, :D]
2025-05-07T20:32:44.2923006Z         x1 = x[:, D:]
2025-05-07T20:32:44.2923082Z     
2025-05-07T20:32:44.2923175Z         if contiguous:
2025-05-07T20:32:44.2923271Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2923366Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2923448Z     
2025-05-07T20:32:44.2923542Z         if scale_ub is not None:
2025-05-07T20:32:44.2923649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2923794Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2923870Z             )
2025-05-07T20:32:44.2923950Z         else:
2025-05-07T20:32:44.2924051Z             scale_ub_tensor = None
2025-05-07T20:32:44.2924125Z     
2025-05-07T20:32:44.2924254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2924358Z             op = silu_mul_quant
2025-05-07T20:32:44.2924447Z             if compiled:
2025-05-07T20:32:44.2924560Z                 op = torch.compile(op)
2025-05-07T20:32:44.2924674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2924750Z     
2025-05-07T20:32:44.2924848Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.2924976Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.2925053Z     
2025-05-07T20:32:44.2925196Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2925299Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.2925406Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.2925537Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.2925677Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2925761Z     
2025-05-07T20:32:44.2925865Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.2925869Z 
2025-05-07T20:32:44.2925971Z moe/activation_test.py:126: 
2025-05-07T20:32:44.2926106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2926213Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.2926353Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.2926916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.2927020Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.2927385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2927607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2927977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.2928238Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.2928617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.2928788Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.2929129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.2929211Z     fn()
2025-05-07T20:32:44.2929609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.2929696Z     self.fn.run(
2025-05-07T20:32:44.2930034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2930128Z     kernel = self.compile(
2025-05-07T20:32:44.2930510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2930768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2930900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2930984Z 
2025-05-07T20:32:44.2931188Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8c2ecc0>
2025-05-07T20:32:44.2931969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2932476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c51b20>}
2025-05-07T20:32:44.2933222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2933419Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc62c70>
2025-05-07T20:32:44.2933423Z 
2025-05-07T20:32:44.2933589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2933858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2933973Z                            module_map=module_map)
2025-05-07T20:32:44.2934135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2934246Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.2934328Z E       ^
2025-05-07T20:32:44.2934683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2934687Z 
2025-05-07T20:32:44.2935103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2935107Z 
2025-05-07T20:32:44.2935224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2935449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2935536Z     T=1,
2025-05-07T20:32:44.2935619Z     D=5120,
2025-05-07T20:32:44.2935708Z     scale_ub=None,
2025-05-07T20:32:44.2935796Z     contiguous=True,
2025-05-07T20:32:44.2935882Z     compiled=False,
2025-05-07T20:32:44.2935965Z )
2025-05-07T20:32:44.2936183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2936348Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.2936353Z 
2025-05-07T20:32:44.2936436Z     @given(
2025-05-07T20:32:44.2936556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2936658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2936783Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2936902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2937028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2937105Z     )
2025-05-07T20:32:44.2937351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2937458Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2937537Z         self,
2025-05-07T20:32:44.2937617Z         T: int,
2025-05-07T20:32:44.2937698Z         D: int,
2025-05-07T20:32:44.2937799Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2937889Z         contiguous: bool,
2025-05-07T20:32:44.2937981Z         compiled: bool,
2025-05-07T20:32:44.2938061Z     ) -> None:
2025-05-07T20:32:44.2938157Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2938236Z     
2025-05-07T20:32:44.2938408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2938493Z     
2025-05-07T20:32:44.2938585Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2938709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2938888Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2938974Z         x0 = x[:, :D]
2025-05-07T20:32:44.2939057Z         x1 = x[:, D:]
2025-05-07T20:32:44.2939136Z     
2025-05-07T20:32:44.2939222Z         if contiguous:
2025-05-07T20:32:44.2939390Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2939484Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2939559Z     
2025-05-07T20:32:44.2939650Z         if scale_ub is not None:
2025-05-07T20:32:44.2939762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2939896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2939978Z             )
2025-05-07T20:32:44.2940056Z         else:
2025-05-07T20:32:44.2940152Z             scale_ub_tensor = None
2025-05-07T20:32:44.2940232Z     
2025-05-07T20:32:44.2940363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2940454Z             op = silu_mul_quant
2025-05-07T20:32:44.2940545Z             if compiled:
2025-05-07T20:32:44.2940653Z                 op = torch.compile(op)
2025-05-07T20:32:44.2940760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2940846Z     
2025-05-07T20:32:44.2940937Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2940947Z 
2025-05-07T20:32:44.2941046Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2941180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2941281Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2941387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2941883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2941984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2942351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2942571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2942914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2943014Z     kernel = self.compile(
2025-05-07T20:32:44.2943399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2943585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2943714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2943719Z 
2025-05-07T20:32:44.2943927Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be86198b0>
2025-05-07T20:32:44.2944709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2945215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c52d40>}
2025-05-07T20:32:44.2945964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2946158Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bc2dc70>
2025-05-07T20:32:44.2946163Z 
2025-05-07T20:32:44.2946333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2946594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2946701Z                            module_map=module_map)
2025-05-07T20:32:44.2946871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2946972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2947051Z E       ^
2025-05-07T20:32:44.2947525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2947530Z 
2025-05-07T20:32:44.2947942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2948045Z 
2025-05-07T20:32:44.2948155Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2948379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2948458Z     T=128,
2025-05-07T20:32:44.2948541Z     D=5120,
2025-05-07T20:32:44.2948625Z     scale_ub=None,
2025-05-07T20:32:44.2948713Z     contiguous=False,
2025-05-07T20:32:44.2948802Z     compiled=True,
2025-05-07T20:32:44.2948877Z )
2025-05-07T20:32:44.2949096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2949270Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.2949274Z 
2025-05-07T20:32:44.2949353Z     @given(
2025-05-07T20:32:44.2949482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2949583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2949702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2949830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2949943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2950022Z     )
2025-05-07T20:32:44.2950269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2950364Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2950446Z         self,
2025-05-07T20:32:44.2950524Z         T: int,
2025-05-07T20:32:44.2950602Z         D: int,
2025-05-07T20:32:44.2950708Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2950798Z         contiguous: bool,
2025-05-07T20:32:44.2950885Z         compiled: bool,
2025-05-07T20:32:44.2950974Z     ) -> None:
2025-05-07T20:32:44.2951069Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2951151Z     
2025-05-07T20:32:44.2951327Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2951404Z     
2025-05-07T20:32:44.2951497Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2951634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2951722Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2951804Z         x0 = x[:, :D]
2025-05-07T20:32:44.2951898Z         x1 = x[:, D:]
2025-05-07T20:32:44.2951971Z     
2025-05-07T20:32:44.2952060Z         if contiguous:
2025-05-07T20:32:44.2952153Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2952243Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2952326Z     
2025-05-07T20:32:44.2952417Z         if scale_ub is not None:
2025-05-07T20:32:44.2952523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2952663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2952745Z             )
2025-05-07T20:32:44.2952823Z         else:
2025-05-07T20:32:44.2952927Z             scale_ub_tensor = None
2025-05-07T20:32:44.2953001Z     
2025-05-07T20:32:44.2953130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2953226Z             op = silu_mul_quant
2025-05-07T20:32:44.2953315Z             if compiled:
2025-05-07T20:32:44.2953419Z                 op = torch.compile(op)
2025-05-07T20:32:44.2953526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2953601Z     
2025-05-07T20:32:44.2953696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2953700Z 
2025-05-07T20:32:44.2953798Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2953928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2954035Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2954136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2954503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.2954688Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.2955183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2955358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2955761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2955987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2956331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2956426Z     kernel = self.compile(
2025-05-07T20:32:44.2956810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2956984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2957119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2957124Z 
2025-05-07T20:32:44.2957333Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9119ac0>
2025-05-07T20:32:44.2958107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2958619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8c50f40>}
2025-05-07T20:32:44.2959361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2959555Z context = <triton._C.libtriton.ir.context object at 0x7f5be84476b0>
2025-05-07T20:32:44.2959559Z 
2025-05-07T20:32:44.2959733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2959997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2960115Z                            module_map=module_map)
2025-05-07T20:32:44.2960277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2960377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2960467Z E       ^
2025-05-07T20:32:44.2960822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2960826Z 
2025-05-07T20:32:44.2961241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2961245Z 
2025-05-07T20:32:44.2961350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2961581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2961666Z     T=128,
2025-05-07T20:32:44.2961744Z     D=7168,
2025-05-07T20:32:44.2961830Z     scale_ub=1200.0,
2025-05-07T20:32:44.2961926Z     contiguous=False,
2025-05-07T20:32:44.2962017Z     compiled=False,
2025-05-07T20:32:44.2962092Z )
2025-05-07T20:32:44.2962314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2962487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.2962492Z 
2025-05-07T20:32:44.2962573Z     @given(
2025-05-07T20:32:44.2962692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2962790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2962910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2963030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2963147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2963225Z     )
2025-05-07T20:32:44.2963553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2963649Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2963733Z         self,
2025-05-07T20:32:44.2963811Z         T: int,
2025-05-07T20:32:44.2963967Z         D: int,
2025-05-07T20:32:44.2964067Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2964157Z         contiguous: bool,
2025-05-07T20:32:44.2964248Z         compiled: bool,
2025-05-07T20:32:44.2964328Z     ) -> None:
2025-05-07T20:32:44.2964424Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2964501Z     
2025-05-07T20:32:44.2964672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2964748Z     
2025-05-07T20:32:44.2964843Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2964968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2965057Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2965145Z         x0 = x[:, :D]
2025-05-07T20:32:44.2965226Z         x1 = x[:, D:]
2025-05-07T20:32:44.2965307Z     
2025-05-07T20:32:44.2965631Z         if contiguous:
2025-05-07T20:32:44.2965774Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2965888Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2965969Z     
2025-05-07T20:32:44.2966062Z         if scale_ub is not None:
2025-05-07T20:32:44.2966173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2966308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2966386Z             )
2025-05-07T20:32:44.2966470Z         else:
2025-05-07T20:32:44.2966568Z             scale_ub_tensor = None
2025-05-07T20:32:44.2966643Z     
2025-05-07T20:32:44.2966782Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2966878Z             op = silu_mul_quant
2025-05-07T20:32:44.2966967Z             if compiled:
2025-05-07T20:32:44.2967072Z                 op = torch.compile(op)
2025-05-07T20:32:44.2967179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2967257Z     
2025-05-07T20:32:44.2967355Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2967360Z 
2025-05-07T20:32:44.2967459Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2967596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2967702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2967807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2968303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2968401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2968763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2968987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2969324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2969425Z     kernel = self.compile(
2025-05-07T20:32:44.2969804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2969983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2970116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2970121Z 
2025-05-07T20:32:44.2970324Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be861af00>
2025-05-07T20:32:44.2971108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2971609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8660e00>}
2025-05-07T20:32:44.2972511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2972817Z context = <triton._C.libtriton.ir.context object at 0x7f5be8427f70>
2025-05-07T20:32:44.2972821Z 
2025-05-07T20:32:44.2972986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2973254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2973363Z                            module_map=module_map)
2025-05-07T20:32:44.2973525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2973635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2973715Z E       ^
2025-05-07T20:32:44.2974074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2974078Z 
2025-05-07T20:32:44.2974493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2974497Z 
2025-05-07T20:32:44.2974606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2974837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2974917Z     T=128,
2025-05-07T20:32:44.2974999Z     D=5120,
2025-05-07T20:32:44.2975084Z     scale_ub=None,
2025-05-07T20:32:44.2975172Z     contiguous=False,
2025-05-07T20:32:44.2975262Z     compiled=False,
2025-05-07T20:32:44.2975336Z )
2025-05-07T20:32:44.2975553Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2975731Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.2975735Z 
2025-05-07T20:32:44.2975814Z     @given(
2025-05-07T20:32:44.2975934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2976042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2976157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2976280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2976400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2976476Z     )
2025-05-07T20:32:44.2976728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2976821Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2976899Z         self,
2025-05-07T20:32:44.2976982Z         T: int,
2025-05-07T20:32:44.2977059Z         D: int,
2025-05-07T20:32:44.2977160Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2977254Z         contiguous: bool,
2025-05-07T20:32:44.2977343Z         compiled: bool,
2025-05-07T20:32:44.2977424Z     ) -> None:
2025-05-07T20:32:44.2977523Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2977598Z     
2025-05-07T20:32:44.2977770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2977851Z     
2025-05-07T20:32:44.2977944Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2978075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2978165Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2978254Z         x0 = x[:, :D]
2025-05-07T20:32:44.2978340Z         x1 = x[:, D:]
2025-05-07T20:32:44.2978414Z     
2025-05-07T20:32:44.2978501Z         if contiguous:
2025-05-07T20:32:44.2978599Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2978690Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2978765Z     
2025-05-07T20:32:44.2978861Z         if scale_ub is not None:
2025-05-07T20:32:44.2978967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2979101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2979184Z             )
2025-05-07T20:32:44.2979261Z         else:
2025-05-07T20:32:44.2979362Z             scale_ub_tensor = None
2025-05-07T20:32:44.2979437Z     
2025-05-07T20:32:44.2979652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2979754Z             op = silu_mul_quant
2025-05-07T20:32:44.2979841Z             if compiled:
2025-05-07T20:32:44.2979942Z                 op = torch.compile(op)
2025-05-07T20:32:44.2980153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2980228Z     
2025-05-07T20:32:44.2980322Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2980326Z 
2025-05-07T20:32:44.2980433Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2980562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2980671Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2980772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2981267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2981368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2981730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2981955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2982307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2982401Z     kernel = self.compile(
2025-05-07T20:32:44.2982785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2982958Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2983087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2983091Z 
2025-05-07T20:32:44.2983299Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8651610>
2025-05-07T20:32:44.2984080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2984583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8662980>}
2025-05-07T20:32:44.2985329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2985519Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfebcb0>
2025-05-07T20:32:44.2985528Z 
2025-05-07T20:32:44.2985696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2985957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2986071Z                            module_map=module_map)
2025-05-07T20:32:44.2986236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2986337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2986423Z E       ^
2025-05-07T20:32:44.2986778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2986786Z 
2025-05-07T20:32:44.2987196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2987205Z 
2025-05-07T20:32:44.2987309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2987533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2987618Z     T=128,
2025-05-07T20:32:44.2987697Z     D=5120,
2025-05-07T20:32:44.2987780Z     scale_ub=1200.0,
2025-05-07T20:32:44.2987870Z     contiguous=True,
2025-05-07T20:32:44.2987955Z     compiled=False,
2025-05-07T20:32:44.2988030Z )
2025-05-07T20:32:44.2988335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2988509Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.2988514Z 
2025-05-07T20:32:44.2988597Z     @given(
2025-05-07T20:32:44.2988793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2988895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2989012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2989129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2989244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2989323Z     )
2025-05-07T20:32:44.2989565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2989662Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2989744Z         self,
2025-05-07T20:32:44.2989821Z         T: int,
2025-05-07T20:32:44.2989898Z         D: int,
2025-05-07T20:32:44.2990001Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2990096Z         contiguous: bool,
2025-05-07T20:32:44.2990189Z         compiled: bool,
2025-05-07T20:32:44.2990267Z     ) -> None:
2025-05-07T20:32:44.2990363Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2990448Z     
2025-05-07T20:32:44.2990616Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2990692Z     
2025-05-07T20:32:44.2990788Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2990914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2991006Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2991092Z         x0 = x[:, :D]
2025-05-07T20:32:44.2991173Z         x1 = x[:, D:]
2025-05-07T20:32:44.2991246Z     
2025-05-07T20:32:44.2991334Z         if contiguous:
2025-05-07T20:32:44.2991426Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2991518Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2991593Z     
2025-05-07T20:32:44.2991684Z         if scale_ub is not None:
2025-05-07T20:32:44.2991796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2991931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2992007Z             )
2025-05-07T20:32:44.2992088Z         else:
2025-05-07T20:32:44.2992189Z             scale_ub_tensor = None
2025-05-07T20:32:44.2992263Z     
2025-05-07T20:32:44.2992397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2992490Z             op = silu_mul_quant
2025-05-07T20:32:44.2992576Z             if compiled:
2025-05-07T20:32:44.2992681Z                 op = torch.compile(op)
2025-05-07T20:32:44.2992788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2992866Z     
2025-05-07T20:32:44.2992959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2992964Z 
2025-05-07T20:32:44.2993063Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2993195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2993297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2993402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2993901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2994005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2994361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2994586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2994925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2995023Z     kernel = self.compile(
2025-05-07T20:32:44.2995408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2995583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2995900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2995906Z 
2025-05-07T20:32:44.2996116Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8653d40>
2025-05-07T20:32:44.2996966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2997465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be86602c0>}
2025-05-07T20:32:44.2998213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2998404Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bdbf7f0>
2025-05-07T20:32:44.2998413Z 
2025-05-07T20:32:44.2998579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2998845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2998957Z                            module_map=module_map)
2025-05-07T20:32:44.2999119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2999224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2999305Z E       ^
2025-05-07T20:32:44.2999666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2999671Z 
2025-05-07T20:32:44.3000080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3000085Z 
2025-05-07T20:32:44.3000194Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3000427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3000507Z     T=1,
2025-05-07T20:32:44.3000591Z     D=7168,
2025-05-07T20:32:44.3000677Z     scale_ub=1200.0,
2025-05-07T20:32:44.3000764Z     contiguous=True,
2025-05-07T20:32:44.3000862Z     compiled=True,
2025-05-07T20:32:44.3000937Z )
2025-05-07T20:32:44.3001156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3001326Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3001331Z 
2025-05-07T20:32:44.3001411Z     @given(
2025-05-07T20:32:44.3001533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3001639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3001756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3001874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3001992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3002068Z     )
2025-05-07T20:32:44.3002322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3002418Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3002497Z         self,
2025-05-07T20:32:44.3002581Z         T: int,
2025-05-07T20:32:44.3002664Z         D: int,
2025-05-07T20:32:44.3002766Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3002860Z         contiguous: bool,
2025-05-07T20:32:44.3002949Z         compiled: bool,
2025-05-07T20:32:44.3003030Z     ) -> None:
2025-05-07T20:32:44.3003130Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3003209Z     
2025-05-07T20:32:44.3003377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3003459Z     
2025-05-07T20:32:44.3003554Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3003684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3003775Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3003860Z         x0 = x[:, :D]
2025-05-07T20:32:44.3003946Z         x1 = x[:, D:]
2025-05-07T20:32:44.3004105Z     
2025-05-07T20:32:44.3004194Z         if contiguous:
2025-05-07T20:32:44.3004299Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3004393Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3004543Z     
2025-05-07T20:32:44.3004642Z         if scale_ub is not None:
2025-05-07T20:32:44.3004751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3004888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3004970Z             )
2025-05-07T20:32:44.3005049Z         else:
2025-05-07T20:32:44.3005151Z             scale_ub_tensor = None
2025-05-07T20:32:44.3005227Z     
2025-05-07T20:32:44.3005357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3005458Z             op = silu_mul_quant
2025-05-07T20:32:44.3005551Z             if compiled:
2025-05-07T20:32:44.3005653Z                 op = torch.compile(op)
2025-05-07T20:32:44.3005763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3005842Z     
2025-05-07T20:32:44.3005943Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3005947Z 
2025-05-07T20:32:44.3006049Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3010000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3010135Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3010242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3010619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3010725Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3011222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3011328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3011684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3011912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3012254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3012356Z     kernel = self.compile(
2025-05-07T20:32:44.3012741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3012920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3013051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3013056Z 
2025-05-07T20:32:44.3013266Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8651310>
2025-05-07T20:32:44.3014047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3014562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8662ca0>}
2025-05-07T20:32:44.3015313Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3015507Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bd0e3f0>
2025-05-07T20:32:44.3015511Z 
2025-05-07T20:32:44.3015687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3015952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3016065Z                            module_map=module_map)
2025-05-07T20:32:44.3016229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3016332Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3016542Z E       ^
2025-05-07T20:32:44.3016904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3016985Z 
2025-05-07T20:32:44.3017402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3017411Z 
2025-05-07T20:32:44.3017515Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3017739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3017823Z     T=1,
2025-05-07T20:32:44.3017903Z     D=7168,
2025-05-07T20:32:44.3017994Z     scale_ub=1200.0,
2025-05-07T20:32:44.3018083Z     contiguous=False,
2025-05-07T20:32:44.3018168Z     compiled=True,
2025-05-07T20:32:44.3018251Z )
2025-05-07T20:32:44.3018471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3018644Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3018649Z 
2025-05-07T20:32:44.3018736Z     @given(
2025-05-07T20:32:44.3018858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3018966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3019089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3019207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3019326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3019403Z     )
2025-05-07T20:32:44.3019649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3019748Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3019825Z         self,
2025-05-07T20:32:44.3019904Z         T: int,
2025-05-07T20:32:44.3019989Z         D: int,
2025-05-07T20:32:44.3020090Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3020182Z         contiguous: bool,
2025-05-07T20:32:44.3020277Z         compiled: bool,
2025-05-07T20:32:44.3020360Z     ) -> None:
2025-05-07T20:32:44.3020465Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3020542Z     
2025-05-07T20:32:44.3020712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3020797Z     
2025-05-07T20:32:44.3020891Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3021019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3021123Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3021206Z         x0 = x[:, :D]
2025-05-07T20:32:44.3021288Z         x1 = x[:, D:]
2025-05-07T20:32:44.3021372Z     
2025-05-07T20:32:44.3021459Z         if contiguous:
2025-05-07T20:32:44.3021555Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3021653Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3021729Z     
2025-05-07T20:32:44.3021823Z         if scale_ub is not None:
2025-05-07T20:32:44.3021937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3022073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3022159Z             )
2025-05-07T20:32:44.3022240Z         else:
2025-05-07T20:32:44.3022338Z             scale_ub_tensor = None
2025-05-07T20:32:44.3022417Z     
2025-05-07T20:32:44.3022548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3022646Z             op = silu_mul_quant
2025-05-07T20:32:44.3022740Z             if compiled:
2025-05-07T20:32:44.3022841Z                 op = torch.compile(op)
2025-05-07T20:32:44.3022949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3023030Z     
2025-05-07T20:32:44.3023122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3023126Z 
2025-05-07T20:32:44.3023235Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3023369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3023472Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3023578Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3024035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3024134Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3024634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3024810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3025173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3025397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3025736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3025836Z     kernel = self.compile(
2025-05-07T20:32:44.3026218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3026397Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3026530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3026535Z 
2025-05-07T20:32:44.3026741Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8653320>
2025-05-07T20:32:44.3027534Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3028038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8bdd0e00>}
2025-05-07T20:32:44.3028786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3028981Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfb2e30>
2025-05-07T20:32:44.3028985Z 
2025-05-07T20:32:44.3029153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3029420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3029535Z                            module_map=module_map)
2025-05-07T20:32:44.3029704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3029810Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3029890Z E       ^
2025-05-07T20:32:44.3030247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3030252Z 
2025-05-07T20:32:44.3030664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3030669Z 
2025-05-07T20:32:44.3030777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3031010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3031089Z     T=1,
2025-05-07T20:32:44.3031172Z     D=7168,
2025-05-07T20:32:44.3031257Z     scale_ub=None,
2025-05-07T20:32:44.3031349Z     contiguous=False,
2025-05-07T20:32:44.3031439Z     compiled=True,
2025-05-07T20:32:44.3031514Z )
2025-05-07T20:32:44.3031733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3031905Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3031910Z 
2025-05-07T20:32:44.3031988Z     @given(
2025-05-07T20:32:44.3032109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3032219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3032336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3032459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3032575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3032732Z     )
2025-05-07T20:32:44.3032983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3033079Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3033237Z         self,
2025-05-07T20:32:44.3033321Z         T: int,
2025-05-07T20:32:44.3033401Z         D: int,
2025-05-07T20:32:44.3033503Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3033601Z         contiguous: bool,
2025-05-07T20:32:44.3033689Z         compiled: bool,
2025-05-07T20:32:44.3033771Z     ) -> None:
2025-05-07T20:32:44.3033874Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3033950Z     
2025-05-07T20:32:44.3034126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3034202Z     
2025-05-07T20:32:44.3034296Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3034425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3034515Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3034597Z         x0 = x[:, :D]
2025-05-07T20:32:44.3034691Z         x1 = x[:, D:]
2025-05-07T20:32:44.3034766Z     
2025-05-07T20:32:44.3034853Z         if contiguous:
2025-05-07T20:32:44.3034953Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3035051Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3035127Z     
2025-05-07T20:32:44.3035228Z         if scale_ub is not None:
2025-05-07T20:32:44.3035340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3035481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3035558Z             )
2025-05-07T20:32:44.3035635Z         else:
2025-05-07T20:32:44.3035803Z             scale_ub_tensor = None
2025-05-07T20:32:44.3035880Z     
2025-05-07T20:32:44.3036010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3036108Z             op = silu_mul_quant
2025-05-07T20:32:44.3036193Z             if compiled:
2025-05-07T20:32:44.3036296Z                 op = torch.compile(op)
2025-05-07T20:32:44.3036412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3036488Z     
2025-05-07T20:32:44.3036581Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.3036711Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.3036790Z     
2025-05-07T20:32:44.3036928Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3037034Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.3037134Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.3037264Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.3037403Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.3037479Z     
2025-05-07T20:32:44.3037584Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.3037588Z 
2025-05-07T20:32:44.3037689Z moe/activation_test.py:126: 
2025-05-07T20:32:44.3037820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3037931Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.3038073Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.3038633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.3038740Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.3039102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3039331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3039696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.3039956Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.3040328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.3040580Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.3040928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.3041088Z     fn()
2025-05-07T20:32:44.3041491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.3041579Z     self.fn.run(
2025-05-07T20:32:44.3041915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3042013Z     kernel = self.compile(
2025-05-07T20:32:44.3042394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3042570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3042703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3042707Z 
2025-05-07T20:32:44.3042920Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8582c90>
2025-05-07T20:32:44.3043705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3044212Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be858e8e0>}
2025-05-07T20:32:44.3044958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3045156Z context = <triton._C.libtriton.ir.context object at 0x7f5be977d330>
2025-05-07T20:32:44.3045160Z 
2025-05-07T20:32:44.3045333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3045601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3045712Z                            module_map=module_map)
2025-05-07T20:32:44.3045883Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3045992Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.3046071Z E       ^
2025-05-07T20:32:44.3046426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3046434Z 
2025-05-07T20:32:44.3046844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3046848Z 
2025-05-07T20:32:44.3046952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3047178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3047257Z     T=1,
2025-05-07T20:32:44.3047335Z     D=5120,
2025-05-07T20:32:44.3047427Z     scale_ub=1200.0,
2025-05-07T20:32:44.3047516Z     contiguous=False,
2025-05-07T20:32:44.3047600Z     compiled=True,
2025-05-07T20:32:44.3047679Z )
2025-05-07T20:32:44.3047902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3048072Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3048076Z 
2025-05-07T20:32:44.3048153Z     @given(
2025-05-07T20:32:44.3048273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3048377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3048492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3048611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3048728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3048803Z     )
2025-05-07T20:32:44.3049049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3049252Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3049330Z         self,
2025-05-07T20:32:44.3049411Z         T: int,
2025-05-07T20:32:44.3049489Z         D: int,
2025-05-07T20:32:44.3049588Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3049754Z         contiguous: bool,
2025-05-07T20:32:44.3049847Z         compiled: bool,
2025-05-07T20:32:44.3049926Z     ) -> None:
2025-05-07T20:32:44.3050027Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3050102Z     
2025-05-07T20:32:44.3050269Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3050348Z     
2025-05-07T20:32:44.3050441Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3050564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3050657Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3050739Z         x0 = x[:, :D]
2025-05-07T20:32:44.3050824Z         x1 = x[:, D:]
2025-05-07T20:32:44.3050899Z     
2025-05-07T20:32:44.3050985Z         if contiguous:
2025-05-07T20:32:44.3051088Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3051179Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3051253Z     
2025-05-07T20:32:44.3051349Z         if scale_ub is not None:
2025-05-07T20:32:44.3051462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3051597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3051682Z             )
2025-05-07T20:32:44.3051759Z         else:
2025-05-07T20:32:44.3051854Z             scale_ub_tensor = None
2025-05-07T20:32:44.3051930Z     
2025-05-07T20:32:44.3052063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3052154Z             op = silu_mul_quant
2025-05-07T20:32:44.3052244Z             if compiled:
2025-05-07T20:32:44.3052343Z                 op = torch.compile(op)
2025-05-07T20:32:44.3052457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3052530Z     
2025-05-07T20:32:44.3052623Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3052628Z 
2025-05-07T20:32:44.3052734Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3052862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3052967Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3053076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3053448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3053541Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3054032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3054133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3054489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3054713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3055060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3055156Z     kernel = self.compile(
2025-05-07T20:32:44.3055539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3055716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3055847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3055852Z 
2025-05-07T20:32:44.3056054Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8583440>
2025-05-07T20:32:44.3056829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3057421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be858c540>}
2025-05-07T20:32:44.3058167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3058438Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b8c8870>
2025-05-07T20:32:44.3058443Z 
2025-05-07T20:32:44.3058607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3058877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3058986Z                            module_map=module_map)
2025-05-07T20:32:44.3059147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3059250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3059330Z E       ^
2025-05-07T20:32:44.3059690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3059695Z 
2025-05-07T20:32:44.3060111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3060123Z 
2025-05-07T20:32:44.3060229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3060457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3060539Z     T=1,
2025-05-07T20:32:44.3060619Z     D=5120,
2025-05-07T20:32:44.3060709Z     scale_ub=1200.0,
2025-05-07T20:32:44.3060800Z     contiguous=False,
2025-05-07T20:32:44.3060886Z     compiled=False,
2025-05-07T20:32:44.3060965Z )
2025-05-07T20:32:44.3061184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3061352Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3061356Z 
2025-05-07T20:32:44.3061438Z     @given(
2025-05-07T20:32:44.3061563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3061673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3061791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3061915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3062038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3062114Z     )
2025-05-07T20:32:44.3062359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3062457Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3062538Z         self,
2025-05-07T20:32:44.3062619Z         T: int,
2025-05-07T20:32:44.3062701Z         D: int,
2025-05-07T20:32:44.3062801Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3062897Z         contiguous: bool,
2025-05-07T20:32:44.3062984Z         compiled: bool,
2025-05-07T20:32:44.3063067Z     ) -> None:
2025-05-07T20:32:44.3063167Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3063242Z     
2025-05-07T20:32:44.3063417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3063501Z     
2025-05-07T20:32:44.3063594Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3063726Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3063820Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3063900Z         x0 = x[:, :D]
2025-05-07T20:32:44.3063981Z         x1 = x[:, D:]
2025-05-07T20:32:44.3064060Z     
2025-05-07T20:32:44.3064147Z         if contiguous:
2025-05-07T20:32:44.3064239Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3064331Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3064407Z     
2025-05-07T20:32:44.3064502Z         if scale_ub is not None:
2025-05-07T20:32:44.3064612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3064747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3064827Z             )
2025-05-07T20:32:44.3064904Z         else:
2025-05-07T20:32:44.3065082Z             scale_ub_tensor = None
2025-05-07T20:32:44.3065160Z     
2025-05-07T20:32:44.3065292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3065583Z             op = silu_mul_quant
2025-05-07T20:32:44.3065815Z             if compiled:
2025-05-07T20:32:44.3065925Z                 op = torch.compile(op)
2025-05-07T20:32:44.3066029Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3066106Z     
2025-05-07T20:32:44.3066196Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3066201Z 
2025-05-07T20:32:44.3066302Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3066429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3066529Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3066632Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3067127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3067228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3067587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3067812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3068153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3068244Z     kernel = self.compile(
2025-05-07T20:32:44.3068625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3068802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3068926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3068930Z 
2025-05-07T20:32:44.3069139Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be853b5c0>
2025-05-07T20:32:44.3069918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3070422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8b5e0c0>}
2025-05-07T20:32:44.3071168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3071356Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b74e2b0>
2025-05-07T20:32:44.3071361Z 
2025-05-07T20:32:44.3071528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3071788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3071900Z                            module_map=module_map)
2025-05-07T20:32:44.3072071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3072172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3072260Z E       ^
2025-05-07T20:32:44.3072613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3072618Z 
2025-05-07T20:32:44.3073025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3073030Z 
2025-05-07T20:32:44.3073137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3073361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3073440Z     T=16384,
2025-05-07T20:32:44.3073523Z     D=5120,
2025-05-07T20:32:44.3073608Z     scale_ub=1200.0,
2025-05-07T20:32:44.3073699Z     contiguous=False,
2025-05-07T20:32:44.3073926Z     compiled=True,
2025-05-07T20:32:44.3074005Z )
2025-05-07T20:32:44.3074229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3074407Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3074546Z 
2025-05-07T20:32:44.3074625Z     @given(
2025-05-07T20:32:44.3074748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3074847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3074962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3075080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3075194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3075272Z     )
2025-05-07T20:32:44.3075515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3075609Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3075693Z         self,
2025-05-07T20:32:44.3075823Z         T: int,
2025-05-07T20:32:44.3075906Z         D: int,
2025-05-07T20:32:44.3076009Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3076099Z         contiguous: bool,
2025-05-07T20:32:44.3076186Z         compiled: bool,
2025-05-07T20:32:44.3076276Z     ) -> None:
2025-05-07T20:32:44.3076376Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3076450Z     
2025-05-07T20:32:44.3076621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3076699Z     
2025-05-07T20:32:44.3076797Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3076920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3077008Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3077098Z         x0 = x[:, :D]
2025-05-07T20:32:44.3077180Z         x1 = x[:, D:]
2025-05-07T20:32:44.3077254Z     
2025-05-07T20:32:44.3077340Z         if contiguous:
2025-05-07T20:32:44.3077432Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3077522Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3077602Z     
2025-05-07T20:32:44.3077694Z         if scale_ub is not None:
2025-05-07T20:32:44.3077801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3077940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3078024Z             )
2025-05-07T20:32:44.3078106Z         else:
2025-05-07T20:32:44.3078203Z             scale_ub_tensor = None
2025-05-07T20:32:44.3078277Z     
2025-05-07T20:32:44.3078408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3078500Z             op = silu_mul_quant
2025-05-07T20:32:44.3078586Z             if compiled:
2025-05-07T20:32:44.3078689Z                 op = torch.compile(op)
2025-05-07T20:32:44.3078795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3078870Z     
2025-05-07T20:32:44.3078964Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3078968Z 
2025-05-07T20:32:44.3079066Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3079200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3079307Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3079407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3079776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3079875Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3080364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3080466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3080820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3081040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3081379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3081601Z     kernel = self.compile(
2025-05-07T20:32:44.3081985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3082158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3082362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3082367Z 
2025-05-07T20:32:44.3082573Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8538f20>
2025-05-07T20:32:44.3083346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3083850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be90eb560>}
2025-05-07T20:32:44.3084598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3084794Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bfd8cb0>
2025-05-07T20:32:44.3084799Z 
2025-05-07T20:32:44.3084962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3085224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3085335Z                            module_map=module_map)
2025-05-07T20:32:44.3085497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3085597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3085681Z E       ^
2025-05-07T20:32:44.3086035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3086040Z 
2025-05-07T20:32:44.3086455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3086460Z 
2025-05-07T20:32:44.3086565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3086793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3086873Z     T=2048,
2025-05-07T20:32:44.3086949Z     D=7168,
2025-05-07T20:32:44.3087032Z     scale_ub=1200.0,
2025-05-07T20:32:44.3087123Z     contiguous=False,
2025-05-07T20:32:44.3087206Z     compiled=True,
2025-05-07T20:32:44.3087283Z )
2025-05-07T20:32:44.3087498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3087672Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3087677Z 
2025-05-07T20:32:44.3087757Z     @given(
2025-05-07T20:32:44.3087874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3087975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3088095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3088211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3088326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3088405Z     )
2025-05-07T20:32:44.3088647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3088743Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3088820Z         self,
2025-05-07T20:32:44.3088896Z         T: int,
2025-05-07T20:32:44.3088975Z         D: int,
2025-05-07T20:32:44.3089072Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3089159Z         contiguous: bool,
2025-05-07T20:32:44.3089248Z         compiled: bool,
2025-05-07T20:32:44.3089326Z     ) -> None:
2025-05-07T20:32:44.3089421Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3089496Z     
2025-05-07T20:32:44.3089661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3089820Z     
2025-05-07T20:32:44.3089915Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3090039Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3090133Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3090287Z         x0 = x[:, :D]
2025-05-07T20:32:44.3090366Z         x1 = x[:, D:]
2025-05-07T20:32:44.3090442Z     
2025-05-07T20:32:44.3090524Z         if contiguous:
2025-05-07T20:32:44.3090617Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3090708Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3090780Z     
2025-05-07T20:32:44.3090870Z         if scale_ub is not None:
2025-05-07T20:32:44.3090979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3091110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3091184Z             )
2025-05-07T20:32:44.3091264Z         else:
2025-05-07T20:32:44.3091358Z             scale_ub_tensor = None
2025-05-07T20:32:44.3091431Z     
2025-05-07T20:32:44.3091569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3091662Z             op = silu_mul_quant
2025-05-07T20:32:44.3091750Z             if compiled:
2025-05-07T20:32:44.3091848Z                 op = torch.compile(op)
2025-05-07T20:32:44.3091959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3092034Z     
2025-05-07T20:32:44.3092127Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3092132Z 
2025-05-07T20:32:44.3092228Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3092359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3092459Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3092559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3092927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3093018Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3093513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3093610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3093964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3094191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3094526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3094620Z     kernel = self.compile(
2025-05-07T20:32:44.3094997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3095170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3095299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3095303Z 
2025-05-07T20:32:44.3095511Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be853be30>
2025-05-07T20:32:44.3096294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3096800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8d6fd80>}
2025-05-07T20:32:44.3097543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3097736Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b7321f0>
2025-05-07T20:32:44.3097740Z 
2025-05-07T20:32:44.3097902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3098247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3098357Z                            module_map=module_map)
2025-05-07T20:32:44.3098517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3098694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3098771Z E       ^
2025-05-07T20:32:44.3099122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3099130Z 
2025-05-07T20:32:44.3099540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3099544Z 
2025-05-07T20:32:44.3099647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3099871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3099948Z     T=1,
2025-05-07T20:32:44.3100023Z     D=5120,
2025-05-07T20:32:44.3100110Z     scale_ub=None,
2025-05-07T20:32:44.3100201Z     contiguous=False,
2025-05-07T20:32:44.3100285Z     compiled=False,
2025-05-07T20:32:44.3100363Z )
2025-05-07T20:32:44.3100579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3100752Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3100757Z 
2025-05-07T20:32:44.3100836Z     @given(
2025-05-07T20:32:44.3100954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3101058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3101173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3101288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3101406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3101480Z     )
2025-05-07T20:32:44.3101721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3101817Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3101897Z         self,
2025-05-07T20:32:44.3101977Z         T: int,
2025-05-07T20:32:44.3102053Z         D: int,
2025-05-07T20:32:44.3102152Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3102244Z         contiguous: bool,
2025-05-07T20:32:44.3102334Z         compiled: bool,
2025-05-07T20:32:44.3102411Z     ) -> None:
2025-05-07T20:32:44.3102509Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3102582Z     
2025-05-07T20:32:44.3102747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3102825Z     
2025-05-07T20:32:44.3102914Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3103039Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3103132Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3103212Z         x0 = x[:, :D]
2025-05-07T20:32:44.3103296Z         x1 = x[:, D:]
2025-05-07T20:32:44.3103369Z     
2025-05-07T20:32:44.3103453Z         if contiguous:
2025-05-07T20:32:44.3103546Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3103641Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3103713Z     
2025-05-07T20:32:44.3103806Z         if scale_ub is not None:
2025-05-07T20:32:44.3103910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3104049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3104128Z             )
2025-05-07T20:32:44.3104205Z         else:
2025-05-07T20:32:44.3104299Z             scale_ub_tensor = None
2025-05-07T20:32:44.3104377Z     
2025-05-07T20:32:44.3104503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3104591Z             op = silu_mul_quant
2025-05-07T20:32:44.3104679Z             if compiled:
2025-05-07T20:32:44.3104777Z                 op = torch.compile(op)
2025-05-07T20:32:44.3104887Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3104960Z     
2025-05-07T20:32:44.3105049Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3105053Z 
2025-05-07T20:32:44.3105151Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3105361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3105462Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3105562Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3106156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3106254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3106610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3106830Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3107170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3107263Z     kernel = self.compile(
2025-05-07T20:32:44.3107647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3107821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3107950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3107959Z 
2025-05-07T20:32:44.3108163Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b58d70>
2025-05-07T20:32:44.3108941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3109490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be976bd80>}
2025-05-07T20:32:44.3110242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3110431Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bbd9630>
2025-05-07T20:32:44.3110435Z 
2025-05-07T20:32:44.3110607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3110867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3110973Z                            module_map=module_map)
2025-05-07T20:32:44.3111137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3111237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3111317Z E       ^
2025-05-07T20:32:44.3111668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3111673Z 
2025-05-07T20:32:44.3112082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3112091Z 
2025-05-07T20:32:44.3112199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3112421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3112503Z     T=4096,
2025-05-07T20:32:44.3112582Z     D=7168,
2025-05-07T20:32:44.3112664Z     scale_ub=1200.0,
2025-05-07T20:32:44.3112755Z     contiguous=False,
2025-05-07T20:32:44.3112838Z     compiled=False,
2025-05-07T20:32:44.3112911Z )
2025-05-07T20:32:44.3113130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3113306Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3113310Z 
2025-05-07T20:32:44.3113386Z     @given(
2025-05-07T20:32:44.3113507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3113605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3113719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3113916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3114031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3114108Z     )
2025-05-07T20:32:44.3114349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3114518Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3114598Z         self,
2025-05-07T20:32:44.3114674Z         T: int,
2025-05-07T20:32:44.3114749Z         D: int,
2025-05-07T20:32:44.3114852Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3114940Z         contiguous: bool,
2025-05-07T20:32:44.3115024Z         compiled: bool,
2025-05-07T20:32:44.3115106Z     ) -> None:
2025-05-07T20:32:44.3115201Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3115274Z     
2025-05-07T20:32:44.3115443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3115516Z     
2025-05-07T20:32:44.3115611Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3115784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3115873Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3115954Z         x0 = x[:, :D]
2025-05-07T20:32:44.3116032Z         x1 = x[:, D:]
2025-05-07T20:32:44.3116111Z     
2025-05-07T20:32:44.3116196Z         if contiguous:
2025-05-07T20:32:44.3116286Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3116374Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3116449Z     
2025-05-07T20:32:44.3116539Z         if scale_ub is not None:
2025-05-07T20:32:44.3116643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3116779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3116854Z             )
2025-05-07T20:32:44.3116932Z         else:
2025-05-07T20:32:44.3117026Z             scale_ub_tensor = None
2025-05-07T20:32:44.3117100Z     
2025-05-07T20:32:44.3117233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3117322Z             op = silu_mul_quant
2025-05-07T20:32:44.3117411Z             if compiled:
2025-05-07T20:32:44.3117513Z                 op = torch.compile(op)
2025-05-07T20:32:44.3117619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3117691Z     
2025-05-07T20:32:44.3117789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3117793Z 
2025-05-07T20:32:44.3117888Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3118014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3118116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3118216Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3118713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3118807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3119161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3119392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3119731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3119826Z     kernel = self.compile(
2025-05-07T20:32:44.3120205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3120379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3120506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3120510Z 
2025-05-07T20:32:44.3120712Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b598b0>
2025-05-07T20:32:44.3121484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3122067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be93a0e00>}
2025-05-07T20:32:44.3122810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3123075Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b999db0>
2025-05-07T20:32:44.3123080Z 
2025-05-07T20:32:44.3123243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3123505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3123612Z                            module_map=module_map)
2025-05-07T20:32:44.3123773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3123878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3123961Z E       ^
2025-05-07T20:32:44.3124313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3124318Z 
2025-05-07T20:32:44.3124733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3124738Z 
2025-05-07T20:32:44.3124840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3125064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3125143Z     T=16384,
2025-05-07T20:32:44.3125221Z     D=7168,
2025-05-07T20:32:44.3125308Z     scale_ub=None,
2025-05-07T20:32:44.3125393Z     contiguous=True,
2025-05-07T20:32:44.3125478Z     compiled=True,
2025-05-07T20:32:44.3125555Z )
2025-05-07T20:32:44.3125772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3125946Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.3125955Z 
2025-05-07T20:32:44.3126034Z     @given(
2025-05-07T20:32:44.3126154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3126256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3126377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3126494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3126610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3126686Z     )
2025-05-07T20:32:44.3126927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3127024Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3127101Z         self,
2025-05-07T20:32:44.3127185Z         T: int,
2025-05-07T20:32:44.3127262Z         D: int,
2025-05-07T20:32:44.3127360Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3127453Z         contiguous: bool,
2025-05-07T20:32:44.3127538Z         compiled: bool,
2025-05-07T20:32:44.3127617Z     ) -> None:
2025-05-07T20:32:44.3127724Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3127799Z     
2025-05-07T20:32:44.3127966Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3128049Z     
2025-05-07T20:32:44.3128142Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3128266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3128364Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3128446Z         x0 = x[:, :D]
2025-05-07T20:32:44.3128527Z         x1 = x[:, D:]
2025-05-07T20:32:44.3128607Z     
2025-05-07T20:32:44.3128692Z         if contiguous:
2025-05-07T20:32:44.3128786Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3128875Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3128949Z     
2025-05-07T20:32:44.3129043Z         if scale_ub is not None:
2025-05-07T20:32:44.3129148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3129282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3129471Z             )
2025-05-07T20:32:44.3129564Z         else:
2025-05-07T20:32:44.3129674Z             scale_ub_tensor = None
2025-05-07T20:32:44.3129751Z     
2025-05-07T20:32:44.3129880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3130045Z             op = silu_mul_quant
2025-05-07T20:32:44.3130134Z             if compiled:
2025-05-07T20:32:44.3130239Z                 op = torch.compile(op)
2025-05-07T20:32:44.3130345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3130423Z     
2025-05-07T20:32:44.3134193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3134200Z 
2025-05-07T20:32:44.3134319Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3134451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3134559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3134664Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3135051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3135151Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3135643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3135747Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3136106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3136328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3136669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3136764Z     kernel = self.compile(
2025-05-07T20:32:44.3137146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3137324Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3137456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3137460Z 
2025-05-07T20:32:44.3137671Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8afc770>
2025-05-07T20:32:44.3138452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3138956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4d8fe0>}
2025-05-07T20:32:44.3139709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3139906Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bbd5ff0>
2025-05-07T20:32:44.3139910Z 
2025-05-07T20:32:44.3140080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3140344Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3140462Z                            module_map=module_map)
2025-05-07T20:32:44.3140627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3140730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3140808Z E       ^
2025-05-07T20:32:44.3141167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3141172Z 
2025-05-07T20:32:44.3141583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3141588Z 
2025-05-07T20:32:44.3141695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3142044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3142130Z     T=4096,
2025-05-07T20:32:44.3142209Z     D=5120,
2025-05-07T20:32:44.3142292Z     scale_ub=None,
2025-05-07T20:32:44.3142457Z     contiguous=False,
2025-05-07T20:32:44.3142540Z     compiled=True,
2025-05-07T20:32:44.3142618Z )
2025-05-07T20:32:44.3142840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3143013Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3143018Z 
2025-05-07T20:32:44.3143100Z     @given(
2025-05-07T20:32:44.3143220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3143320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3143439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3143556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3143670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3143754Z     )
2025-05-07T20:32:44.3144379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3144477Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3144561Z         self,
2025-05-07T20:32:44.3144640Z         T: int,
2025-05-07T20:32:44.3144721Z         D: int,
2025-05-07T20:32:44.3144820Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3144912Z         contiguous: bool,
2025-05-07T20:32:44.3145004Z         compiled: bool,
2025-05-07T20:32:44.3145086Z     ) -> None:
2025-05-07T20:32:44.3145181Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3145262Z     
2025-05-07T20:32:44.3145429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3145512Z     
2025-05-07T20:32:44.3145605Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3145731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3145826Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3145910Z         x0 = x[:, :D]
2025-05-07T20:32:44.3145995Z         x1 = x[:, D:]
2025-05-07T20:32:44.3146071Z     
2025-05-07T20:32:44.3146158Z         if contiguous:
2025-05-07T20:32:44.3146251Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3146348Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3146422Z     
2025-05-07T20:32:44.3146514Z         if scale_ub is not None:
2025-05-07T20:32:44.3146630Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3146766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3146842Z             )
2025-05-07T20:32:44.3146926Z         else:
2025-05-07T20:32:44.3147021Z             scale_ub_tensor = None
2025-05-07T20:32:44.3147101Z     
2025-05-07T20:32:44.3147231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3147325Z             op = silu_mul_quant
2025-05-07T20:32:44.3147415Z             if compiled:
2025-05-07T20:32:44.3147516Z                 op = torch.compile(op)
2025-05-07T20:32:44.3147626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3147706Z     
2025-05-07T20:32:44.3147800Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3147805Z 
2025-05-07T20:32:44.3147903Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3148040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3148143Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3148246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3148618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3148712Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3149205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3149303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3149658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3149971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3150312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3150483Z     kernel = self.compile(
2025-05-07T20:32:44.3150864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3151039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3151171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3151176Z 
2025-05-07T20:32:44.3151383Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d8c620>
2025-05-07T20:32:44.3152168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3152672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4afb00>}
2025-05-07T20:32:44.3153419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3153616Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bba0830>
2025-05-07T20:32:44.3153620Z 
2025-05-07T20:32:44.3153786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3154052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3154160Z                            module_map=module_map)
2025-05-07T20:32:44.3154321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3154427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3154506Z E       ^
2025-05-07T20:32:44.3154864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3154873Z 
2025-05-07T20:32:44.3155284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3155289Z 
2025-05-07T20:32:44.3155393Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3155621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3155780Z     T=4096,
2025-05-07T20:32:44.3155860Z     D=5120,
2025-05-07T20:32:44.3155947Z     scale_ub=1200.0,
2025-05-07T20:32:44.3156034Z     contiguous=False,
2025-05-07T20:32:44.3156123Z     compiled=False,
2025-05-07T20:32:44.3156203Z )
2025-05-07T20:32:44.3156420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3156604Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3156609Z 
2025-05-07T20:32:44.3156688Z     @given(
2025-05-07T20:32:44.3156808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3156916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3157032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3157149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3157266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3157342Z     )
2025-05-07T20:32:44.3157592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3157686Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3157765Z         self,
2025-05-07T20:32:44.3157848Z         T: int,
2025-05-07T20:32:44.3157926Z         D: int,
2025-05-07T20:32:44.3158026Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3158120Z         contiguous: bool,
2025-05-07T20:32:44.3158289Z         compiled: bool,
2025-05-07T20:32:44.3158369Z     ) -> None:
2025-05-07T20:32:44.3158470Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3158544Z     
2025-05-07T20:32:44.3158716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3158868Z     
2025-05-07T20:32:44.3158962Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3159094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3159188Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3159270Z         x0 = x[:, :D]
2025-05-07T20:32:44.3159355Z         x1 = x[:, D:]
2025-05-07T20:32:44.3159428Z     
2025-05-07T20:32:44.3159512Z         if contiguous:
2025-05-07T20:32:44.3159608Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3159697Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3159771Z     
2025-05-07T20:32:44.3159866Z         if scale_ub is not None:
2025-05-07T20:32:44.3159972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3160114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3160196Z             )
2025-05-07T20:32:44.3160274Z         else:
2025-05-07T20:32:44.3160371Z             scale_ub_tensor = None
2025-05-07T20:32:44.3160455Z     
2025-05-07T20:32:44.3160586Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3160679Z             op = silu_mul_quant
2025-05-07T20:32:44.3160764Z             if compiled:
2025-05-07T20:32:44.3160864Z                 op = torch.compile(op)
2025-05-07T20:32:44.3160975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3161048Z     
2025-05-07T20:32:44.3161140Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3161144Z 
2025-05-07T20:32:44.3161245Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3161374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3161475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3161579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3162081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3162183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3162547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3162770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3163112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3163206Z     kernel = self.compile(
2025-05-07T20:32:44.3163589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3163763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3163894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3163902Z 
2025-05-07T20:32:44.3164110Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d79490>
2025-05-07T20:32:44.3164889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3165759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea4ad3a0>}
2025-05-07T20:32:44.3166516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3166707Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba34b70>
2025-05-07T20:32:44.3166713Z 
2025-05-07T20:32:44.3167028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3167296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3167409Z                            module_map=module_map)
2025-05-07T20:32:44.3167683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3167785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3167869Z E       ^
2025-05-07T20:32:44.3168223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3168228Z 
2025-05-07T20:32:44.3168639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3168647Z 
2025-05-07T20:32:44.3168751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3168977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3169063Z     T=4096,
2025-05-07T20:32:44.3169145Z     D=5120,
2025-05-07T20:32:44.3169232Z     scale_ub=1200.0,
2025-05-07T20:32:44.3169323Z     contiguous=False,
2025-05-07T20:32:44.3169406Z     compiled=True,
2025-05-07T20:32:44.3169481Z )
2025-05-07T20:32:44.3169707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3169882Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3169887Z 
2025-05-07T20:32:44.3169968Z     @given(
2025-05-07T20:32:44.3170089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3170190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3170312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3170429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3170542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3170620Z     )
2025-05-07T20:32:44.3170863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3170961Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3171040Z         self,
2025-05-07T20:32:44.3171118Z         T: int,
2025-05-07T20:32:44.3171197Z         D: int,
2025-05-07T20:32:44.3171307Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3171396Z         contiguous: bool,
2025-05-07T20:32:44.3171485Z         compiled: bool,
2025-05-07T20:32:44.3171564Z     ) -> None:
2025-05-07T20:32:44.3171664Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3171741Z     
2025-05-07T20:32:44.3171909Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3171984Z     
2025-05-07T20:32:44.3172080Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3172203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3172290Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3172372Z         x0 = x[:, :D]
2025-05-07T20:32:44.3172451Z         x1 = x[:, D:]
2025-05-07T20:32:44.3172525Z     
2025-05-07T20:32:44.3172617Z         if contiguous:
2025-05-07T20:32:44.3172708Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3172797Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3172870Z     
2025-05-07T20:32:44.3172959Z         if scale_ub is not None:
2025-05-07T20:32:44.3173071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3173205Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3173280Z             )
2025-05-07T20:32:44.3173360Z         else:
2025-05-07T20:32:44.3173455Z             scale_ub_tensor = None
2025-05-07T20:32:44.3173529Z     
2025-05-07T20:32:44.3173660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3173750Z             op = silu_mul_quant
2025-05-07T20:32:44.3173834Z             if compiled:
2025-05-07T20:32:44.3173937Z                 op = torch.compile(op)
2025-05-07T20:32:44.3174041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3174122Z     
2025-05-07T20:32:44.3174212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3174329Z 
2025-05-07T20:32:44.3174427Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3174556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3174737Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3174836Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3175205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3175296Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3175787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3175888Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3176243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3176466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3176806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3176902Z     kernel = self.compile(
2025-05-07T20:32:44.3177280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3177461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3177586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3177591Z 
2025-05-07T20:32:44.3177793Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d7b140>
2025-05-07T20:32:44.3178569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3179073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea2f7880>}
2025-05-07T20:32:44.3179816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3180013Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b97adf0>
2025-05-07T20:32:44.3180018Z 
2025-05-07T20:32:44.3180184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3180445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3180550Z                            module_map=module_map)
2025-05-07T20:32:44.3180715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3180814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3180900Z E       ^
2025-05-07T20:32:44.3181256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3181261Z 
2025-05-07T20:32:44.3181670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3181678Z 
2025-05-07T20:32:44.3181785Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3182008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3182094Z     T=2048,
2025-05-07T20:32:44.3182173Z     D=7168,
2025-05-07T20:32:44.3182257Z     scale_ub=1200.0,
2025-05-07T20:32:44.3182346Z     contiguous=False,
2025-05-07T20:32:44.3182432Z     compiled=False,
2025-05-07T20:32:44.3182507Z )
2025-05-07T20:32:44.3182727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3182901Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3182905Z 
2025-05-07T20:32:44.3183067Z     @given(
2025-05-07T20:32:44.3183192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3183292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3183410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3183603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3183716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3183794Z     )
2025-05-07T20:32:44.3184035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3184127Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3184208Z         self,
2025-05-07T20:32:44.3184286Z         T: int,
2025-05-07T20:32:44.3184365Z         D: int,
2025-05-07T20:32:44.3184468Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3184557Z         contiguous: bool,
2025-05-07T20:32:44.3184644Z         compiled: bool,
2025-05-07T20:32:44.3184725Z     ) -> None:
2025-05-07T20:32:44.3184820Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3184906Z     
2025-05-07T20:32:44.3185073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3185149Z     
2025-05-07T20:32:44.3185242Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3185371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3185462Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3185546Z         x0 = x[:, :D]
2025-05-07T20:32:44.3185629Z         x1 = x[:, D:]
2025-05-07T20:32:44.3185703Z     
2025-05-07T20:32:44.3185791Z         if contiguous:
2025-05-07T20:32:44.3185882Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3185972Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3186049Z     
2025-05-07T20:32:44.3186140Z         if scale_ub is not None:
2025-05-07T20:32:44.3186246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3186381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3186459Z             )
2025-05-07T20:32:44.3186544Z         else:
2025-05-07T20:32:44.3186640Z             scale_ub_tensor = None
2025-05-07T20:32:44.3186714Z     
2025-05-07T20:32:44.3186845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3186944Z             op = silu_mul_quant
2025-05-07T20:32:44.3187030Z             if compiled:
2025-05-07T20:32:44.3187133Z                 op = torch.compile(op)
2025-05-07T20:32:44.3187237Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3187311Z     
2025-05-07T20:32:44.3187406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3187410Z 
2025-05-07T20:32:44.3187507Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3187637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3187740Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3187840Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3188342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3188442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3188796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3189025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3189361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3189461Z     kernel = self.compile(
2025-05-07T20:32:44.3189837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3190008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3190136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3190141Z 
2025-05-07T20:32:44.3190344Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be97f6a20>
2025-05-07T20:32:44.3191210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3191782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5beab019e0>}
2025-05-07T20:32:44.3192521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3192715Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba9ff70>
2025-05-07T20:32:44.3192719Z 
2025-05-07T20:32:44.3192880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3193152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3193259Z                            module_map=module_map)
2025-05-07T20:32:44.3193419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3193527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3193605Z E       ^
2025-05-07T20:32:44.3193956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3193963Z 
2025-05-07T20:32:44.3194371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3194376Z 
2025-05-07T20:32:44.3194477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3194705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3194781Z     T=1,
2025-05-07T20:32:44.3194857Z     D=7168,
2025-05-07T20:32:44.3194943Z     scale_ub=None,
2025-05-07T20:32:44.3195029Z     contiguous=True,
2025-05-07T20:32:44.3195120Z     compiled=False,
2025-05-07T20:32:44.3195195Z )
2025-05-07T20:32:44.3195411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3195580Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3195584Z 
2025-05-07T20:32:44.3195663Z     @given(
2025-05-07T20:32:44.3195838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3195943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3196056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3196173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3196290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3196362Z     )
2025-05-07T20:32:44.3196604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3196699Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3196775Z         self,
2025-05-07T20:32:44.3196860Z         T: int,
2025-05-07T20:32:44.3196937Z         D: int,
2025-05-07T20:32:44.3197035Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3197125Z         contiguous: bool,
2025-05-07T20:32:44.3197210Z         compiled: bool,
2025-05-07T20:32:44.3197293Z     ) -> None:
2025-05-07T20:32:44.3197393Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3197466Z     
2025-05-07T20:32:44.3197632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3197708Z     
2025-05-07T20:32:44.3197798Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3197922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3198013Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3198095Z         x0 = x[:, :D]
2025-05-07T20:32:44.3198180Z         x1 = x[:, D:]
2025-05-07T20:32:44.3198252Z     
2025-05-07T20:32:44.3198333Z         if contiguous:
2025-05-07T20:32:44.3198426Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3198513Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3198674Z     
2025-05-07T20:32:44.3198768Z         if scale_ub is not None:
2025-05-07T20:32:44.3198872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3199004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3199188Z             )
2025-05-07T20:32:44.3199263Z         else:
2025-05-07T20:32:44.3199355Z             scale_ub_tensor = None
2025-05-07T20:32:44.3199429Z     
2025-05-07T20:32:44.3199556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3199648Z             op = silu_mul_quant
2025-05-07T20:32:44.3199731Z             if compiled:
2025-05-07T20:32:44.3199829Z                 op = torch.compile(op)
2025-05-07T20:32:44.3199935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3200007Z     
2025-05-07T20:32:44.3200096Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3200100Z 
2025-05-07T20:32:44.3200199Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3200331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3200430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3200530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3201027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3201125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3201479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3201697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3202032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3202123Z     kernel = self.compile(
2025-05-07T20:32:44.3202500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3202680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3202805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3202814Z 
2025-05-07T20:32:44.3203019Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be97f7980>
2025-05-07T20:32:44.3203790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3204292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5beab5e020>}
2025-05-07T20:32:44.3205036Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3205222Z context = <triton._C.libtriton.ir.context object at 0x7f5b8ba72bf0>
2025-05-07T20:32:44.3205226Z 
2025-05-07T20:32:44.3205390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3205652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3205766Z                            module_map=module_map)
2025-05-07T20:32:44.3205926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3206025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3206106Z E       ^
2025-05-07T20:32:44.3206460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3206464Z 
2025-05-07T20:32:44.3206871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3206879Z 
2025-05-07T20:32:44.3207060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3207283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3207364Z     T=16384,
2025-05-07T20:32:44.3207574Z     D=7168,
2025-05-07T20:32:44.3207658Z     scale_ub=1200.0,
2025-05-07T20:32:44.3207747Z     contiguous=False,
2025-05-07T20:32:44.3207830Z     compiled=True,
2025-05-07T20:32:44.3207905Z )
2025-05-07T20:32:44.3208125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3208302Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3208306Z 
2025-05-07T20:32:44.3208383Z     @given(
2025-05-07T20:32:44.3208505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3208605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3208722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3208837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3208955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3209038Z     )
2025-05-07T20:32:44.3209279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3209378Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3209457Z         self,
2025-05-07T20:32:44.3209533Z         T: int,
2025-05-07T20:32:44.3209610Z         D: int,
2025-05-07T20:32:44.3209709Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3209798Z         contiguous: bool,
2025-05-07T20:32:44.3209885Z         compiled: bool,
2025-05-07T20:32:44.3209964Z     ) -> None:
2025-05-07T20:32:44.3210061Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3210135Z     
2025-05-07T20:32:44.3210300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3210376Z     
2025-05-07T20:32:44.3210471Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3210593Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3210685Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3210768Z         x0 = x[:, :D]
2025-05-07T20:32:44.3210847Z         x1 = x[:, D:]
2025-05-07T20:32:44.3210920Z     
2025-05-07T20:32:44.3211005Z         if contiguous:
2025-05-07T20:32:44.3211101Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3211187Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3211261Z     
2025-05-07T20:32:44.3211349Z         if scale_ub is not None:
2025-05-07T20:32:44.3211455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3211586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3211661Z             )
2025-05-07T20:32:44.3211742Z         else:
2025-05-07T20:32:44.3211833Z             scale_ub_tensor = None
2025-05-07T20:32:44.3211905Z     
2025-05-07T20:32:44.3212033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3212123Z             op = silu_mul_quant
2025-05-07T20:32:44.3212207Z             if compiled:
2025-05-07T20:32:44.3212310Z                 op = torch.compile(op)
2025-05-07T20:32:44.3212413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3212485Z     
2025-05-07T20:32:44.3212576Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3212584Z 
2025-05-07T20:32:44.3212678Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3212807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3212905Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3213000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3213367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3213457Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3213944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3214043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3214477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3214701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3215109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3215201Z     kernel = self.compile(
2025-05-07T20:32:44.3215581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3215753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3215882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3215886Z 
2025-05-07T20:32:44.3216086Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be985be00>
2025-05-07T20:32:44.3216863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3217362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05131580>}
2025-05-07T20:32:44.3218108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3218298Z context = <triton._C.libtriton.ir.context object at 0x7f5b8bafe1b0>
2025-05-07T20:32:44.3218303Z 
2025-05-07T20:32:44.3218463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3218724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3218831Z                            module_map=module_map)
2025-05-07T20:32:44.3218994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3219097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3219175Z E       ^
2025-05-07T20:32:44.3219526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3219537Z 
2025-05-07T20:32:44.3219948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3219952Z 
2025-05-07T20:32:44.3220055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3220283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3220361Z     T=1,
2025-05-07T20:32:44.3220437Z     D=7168,
2025-05-07T20:32:44.3220522Z     scale_ub=None,
2025-05-07T20:32:44.3220609Z     contiguous=False,
2025-05-07T20:32:44.3220693Z     compiled=False,
2025-05-07T20:32:44.3220772Z )
2025-05-07T20:32:44.3220991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3221155Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3221159Z 
2025-05-07T20:32:44.3221243Z     @given(
2025-05-07T20:32:44.3221366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3221463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3221580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3221695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3221809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3221883Z     )
2025-05-07T20:32:44.3222123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3222218Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3222294Z         self,
2025-05-07T20:32:44.3222371Z         T: int,
2025-05-07T20:32:44.3222451Z         D: int,
2025-05-07T20:32:44.3222548Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3222722Z         contiguous: bool,
2025-05-07T20:32:44.3222812Z         compiled: bool,
2025-05-07T20:32:44.3222888Z     ) -> None:
2025-05-07T20:32:44.3222982Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3223129Z     
2025-05-07T20:32:44.3223292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3223367Z     
2025-05-07T20:32:44.3223456Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3223579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3223668Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3223747Z         x0 = x[:, :D]
2025-05-07T20:32:44.3223825Z         x1 = x[:, D:]
2025-05-07T20:32:44.3223901Z     
2025-05-07T20:32:44.3223981Z         if contiguous:
2025-05-07T20:32:44.3224071Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3224161Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3224233Z     
2025-05-07T20:32:44.3224323Z         if scale_ub is not None:
2025-05-07T20:32:44.3224434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3224568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3224649Z             )
2025-05-07T20:32:44.3224724Z         else:
2025-05-07T20:32:44.3224822Z             scale_ub_tensor = None
2025-05-07T20:32:44.3224899Z     
2025-05-07T20:32:44.3225025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3225113Z             op = silu_mul_quant
2025-05-07T20:32:44.3225200Z             if compiled:
2025-05-07T20:32:44.3225296Z                 op = torch.compile(op)
2025-05-07T20:32:44.3225399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3225473Z     
2025-05-07T20:32:44.3225562Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3225566Z 
2025-05-07T20:32:44.3225660Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3225791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3225889Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3225996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3226487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3226586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3226943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3227160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3227500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3227592Z     kernel = self.compile(
2025-05-07T20:32:44.3227968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3228141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3228268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3228273Z 
2025-05-07T20:32:44.3228472Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be985bb90>
2025-05-07T20:32:44.3229251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3229746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05132840>}
2025-05-07T20:32:44.3230489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3230677Z context = <triton._C.libtriton.ir.context object at 0x7f5be813d4b0>
2025-05-07T20:32:44.3230763Z 
2025-05-07T20:32:44.3230931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3231191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3231393Z                            module_map=module_map)
2025-05-07T20:32:44.3231556Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3231654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3231731Z E       ^
2025-05-07T20:32:44.3232084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3232089Z 
2025-05-07T20:32:44.3232497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3232502Z 
2025-05-07T20:32:44.3232606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3232834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3232911Z     T=2048,
2025-05-07T20:32:44.3232992Z     D=7168,
2025-05-07T20:32:44.3233073Z     scale_ub=None,
2025-05-07T20:32:44.3233158Z     contiguous=False,
2025-05-07T20:32:44.3233249Z     compiled=True,
2025-05-07T20:32:44.3233323Z )
2025-05-07T20:32:44.3233540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3233717Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3233722Z 
2025-05-07T20:32:44.3233799Z     @given(
2025-05-07T20:32:44.3233922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3234018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3234130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3234247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3234359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3234432Z     )
2025-05-07T20:32:44.3234681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3234772Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3234849Z         self,
2025-05-07T20:32:44.3234933Z         T: int,
2025-05-07T20:32:44.3235008Z         D: int,
2025-05-07T20:32:44.3235107Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3235194Z         contiguous: bool,
2025-05-07T20:32:44.3235277Z         compiled: bool,
2025-05-07T20:32:44.3235357Z     ) -> None:
2025-05-07T20:32:44.3235448Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3235520Z     
2025-05-07T20:32:44.3235687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3235810Z     
2025-05-07T20:32:44.3235900Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3236026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3236112Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3236192Z         x0 = x[:, :D]
2025-05-07T20:32:44.3236273Z         x1 = x[:, D:]
2025-05-07T20:32:44.3236349Z     
2025-05-07T20:32:44.3236431Z         if contiguous:
2025-05-07T20:32:44.3236527Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3236614Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3236692Z     
2025-05-07T20:32:44.3236780Z         if scale_ub is not None:
2025-05-07T20:32:44.3236889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3237023Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3237097Z             )
2025-05-07T20:32:44.3237171Z         else:
2025-05-07T20:32:44.3237266Z             scale_ub_tensor = None
2025-05-07T20:32:44.3237339Z     
2025-05-07T20:32:44.3237464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3237554Z             op = silu_mul_quant
2025-05-07T20:32:44.3237637Z             if compiled:
2025-05-07T20:32:44.3237734Z                 op = torch.compile(op)
2025-05-07T20:32:44.3237841Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3237996Z     
2025-05-07T20:32:44.3238091Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3238095Z 
2025-05-07T20:32:44.3238190Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3238317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3238498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3238597Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3238962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3239057Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3239544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3239642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3239996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3240221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3240559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3240656Z     kernel = self.compile(
2025-05-07T20:32:44.3241032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3241206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3241332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3241336Z 
2025-05-07T20:32:44.3241540Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea4cff50>
2025-05-07T20:32:44.3242308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3242816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c0ea80ea0>}
2025-05-07T20:32:44.3243564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3243754Z context = <triton._C.libtriton.ir.context object at 0x7f5be80ef770>
2025-05-07T20:32:44.3243758Z 
2025-05-07T20:32:44.3243924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3244182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3244291Z                            module_map=module_map)
2025-05-07T20:32:44.3244451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3244550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3244636Z E       ^
2025-05-07T20:32:44.3244987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3244996Z 
2025-05-07T20:32:44.3245403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3245411Z 
2025-05-07T20:32:44.3245511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3245731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3245811Z     T=4096,
2025-05-07T20:32:44.3245888Z     D=7168,
2025-05-07T20:32:44.3245970Z     scale_ub=None,
2025-05-07T20:32:44.3246061Z     contiguous=False,
2025-05-07T20:32:44.3246144Z     compiled=True,
2025-05-07T20:32:44.3246217Z )
2025-05-07T20:32:44.3246435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3246691Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3246696Z 
2025-05-07T20:32:44.3246774Z     @given(
2025-05-07T20:32:44.3246896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3246993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3247188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3247304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3247420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3247498Z     )
2025-05-07T20:32:44.3247740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3247833Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3247912Z         self,
2025-05-07T20:32:44.3247990Z         T: int,
2025-05-07T20:32:44.3248066Z         D: int,
2025-05-07T20:32:44.3248167Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3248255Z         contiguous: bool,
2025-05-07T20:32:44.3248344Z         compiled: bool,
2025-05-07T20:32:44.3248428Z     ) -> None:
2025-05-07T20:32:44.3248522Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3248598Z     
2025-05-07T20:32:44.3248763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3248844Z     
2025-05-07T20:32:44.3248939Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3249062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3249151Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3249233Z         x0 = x[:, :D]
2025-05-07T20:32:44.3249312Z         x1 = x[:, D:]
2025-05-07T20:32:44.3249386Z     
2025-05-07T20:32:44.3249473Z         if contiguous:
2025-05-07T20:32:44.3249563Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3249651Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3249729Z     
2025-05-07T20:32:44.3249818Z         if scale_ub is not None:
2025-05-07T20:32:44.3249927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3250060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3250142Z             )
2025-05-07T20:32:44.3250223Z         else:
2025-05-07T20:32:44.3250317Z             scale_ub_tensor = None
2025-05-07T20:32:44.3250390Z     
2025-05-07T20:32:44.3250520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3250616Z             op = silu_mul_quant
2025-05-07T20:32:44.3250700Z             if compiled:
2025-05-07T20:32:44.3250802Z                 op = torch.compile(op)
2025-05-07T20:32:44.3250907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3250981Z     
2025-05-07T20:32:44.3251074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3251078Z 
2025-05-07T20:32:44.3251175Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3251308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3251408Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3251507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3251878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3251975Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3252463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3252567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3256812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3257059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3257410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3257508Z     kernel = self.compile(
2025-05-07T20:32:44.3257891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3258169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3258303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3258308Z 
2025-05-07T20:32:44.3258516Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea37a6f0>
2025-05-07T20:32:44.3259372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3259877Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05e5b060>}
2025-05-07T20:32:44.3260620Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3260818Z context = <triton._C.libtriton.ir.context object at 0x7f5be8136ff0>
2025-05-07T20:32:44.3260822Z 
2025-05-07T20:32:44.3260986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3261252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3261364Z                            module_map=module_map)
2025-05-07T20:32:44.3261526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3261631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3261710Z E       ^
2025-05-07T20:32:44.3262063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3262073Z 
2025-05-07T20:32:44.3262483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3262487Z 
2025-05-07T20:32:44.3262593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3262820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3262900Z     T=16384,
2025-05-07T20:32:44.3262977Z     D=5120,
2025-05-07T20:32:44.3263063Z     scale_ub=1200.0,
2025-05-07T20:32:44.3263153Z     contiguous=False,
2025-05-07T20:32:44.3263239Z     compiled=False,
2025-05-07T20:32:44.3263313Z )
2025-05-07T20:32:44.3263530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3263715Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3263720Z 
2025-05-07T20:32:44.3263796Z     @given(
2025-05-07T20:32:44.3263918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3264017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3264133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3264255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3264367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3264446Z     )
2025-05-07T20:32:44.3264693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3264785Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3264868Z         self,
2025-05-07T20:32:44.3264946Z         T: int,
2025-05-07T20:32:44.3265022Z         D: int,
2025-05-07T20:32:44.3265124Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3265213Z         contiguous: bool,
2025-05-07T20:32:44.3265299Z         compiled: bool,
2025-05-07T20:32:44.3265636Z     ) -> None:
2025-05-07T20:32:44.3265777Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3265874Z     
2025-05-07T20:32:44.3266047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3266121Z     
2025-05-07T20:32:44.3266211Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3266339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3266427Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3266674Z         x0 = x[:, :D]
2025-05-07T20:32:44.3266760Z         x1 = x[:, D:]
2025-05-07T20:32:44.3266834Z     
2025-05-07T20:32:44.3266925Z         if contiguous:
2025-05-07T20:32:44.3267016Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3267213Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3267288Z     
2025-05-07T20:32:44.3267378Z         if scale_ub is not None:
2025-05-07T20:32:44.3267487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3267625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3267700Z             )
2025-05-07T20:32:44.3267778Z         else:
2025-05-07T20:32:44.3267877Z             scale_ub_tensor = None
2025-05-07T20:32:44.3267949Z     
2025-05-07T20:32:44.3268076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3268169Z             op = silu_mul_quant
2025-05-07T20:32:44.3268253Z             if compiled:
2025-05-07T20:32:44.3268353Z                 op = torch.compile(op)
2025-05-07T20:32:44.3268461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3268534Z     
2025-05-07T20:32:44.3268629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3268634Z 
2025-05-07T20:32:44.3268728Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3268861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3268964Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3269066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3269563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3269663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3270020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3270241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3270579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3270671Z     kernel = self.compile(
2025-05-07T20:32:44.3271054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3271231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3271362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3271367Z 
2025-05-07T20:32:44.3271567Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea379670>
2025-05-07T20:32:44.3272343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3272850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05f09120>}
2025-05-07T20:32:44.3273593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3273787Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b5b17b0>
2025-05-07T20:32:44.3273792Z 
2025-05-07T20:32:44.3273952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3274212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3274321Z                            module_map=module_map)
2025-05-07T20:32:44.3274479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3274578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3274653Z E       ^
2025-05-07T20:32:44.3275089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3275094Z 
2025-05-07T20:32:44.3275509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3275587Z 
2025-05-07T20:32:44.3275688Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3275968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3276046Z     T=16384,
2025-05-07T20:32:44.3276121Z     D=5120,
2025-05-07T20:32:44.3276204Z     scale_ub=1200.0,
2025-05-07T20:32:44.3276288Z     contiguous=True,
2025-05-07T20:32:44.3276369Z     compiled=True,
2025-05-07T20:32:44.3276445Z )
2025-05-07T20:32:44.3276660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3276831Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3276835Z 
2025-05-07T20:32:44.3276912Z     @given(
2025-05-07T20:32:44.3277033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3277133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3277245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3277365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3277479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3277552Z     )
2025-05-07T20:32:44.3277792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3277887Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3277962Z         self,
2025-05-07T20:32:44.3278038Z         T: int,
2025-05-07T20:32:44.3278117Z         D: int,
2025-05-07T20:32:44.3278212Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3278300Z         contiguous: bool,
2025-05-07T20:32:44.3278389Z         compiled: bool,
2025-05-07T20:32:44.3278466Z     ) -> None:
2025-05-07T20:32:44.3278564Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3278636Z     
2025-05-07T20:32:44.3278804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3278880Z     
2025-05-07T20:32:44.3278971Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3279094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3279192Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3279271Z         x0 = x[:, :D]
2025-05-07T20:32:44.3279350Z         x1 = x[:, D:]
2025-05-07T20:32:44.3279424Z     
2025-05-07T20:32:44.3279505Z         if contiguous:
2025-05-07T20:32:44.3279596Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3279684Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3279755Z     
2025-05-07T20:32:44.3279844Z         if scale_ub is not None:
2025-05-07T20:32:44.3279952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3280085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3280163Z             )
2025-05-07T20:32:44.3280238Z         else:
2025-05-07T20:32:44.3280336Z             scale_ub_tensor = None
2025-05-07T20:32:44.3280411Z     
2025-05-07T20:32:44.3280538Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3280628Z             op = silu_mul_quant
2025-05-07T20:32:44.3280717Z             if compiled:
2025-05-07T20:32:44.3280814Z                 op = torch.compile(op)
2025-05-07T20:32:44.3280916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3280991Z     
2025-05-07T20:32:44.3281079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3281083Z 
2025-05-07T20:32:44.3281182Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3281307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3281407Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3281508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3281873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3281963Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3282540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3282638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3283066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3283284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3283618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3283715Z     kernel = self.compile(
2025-05-07T20:32:44.3284091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3284260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3284388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3284397Z 
2025-05-07T20:32:44.3284600Z self = <triton.compiler.compiler.ASTSource object at 0x7f5beaa94950>
2025-05-07T20:32:44.3285376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3285881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c05e960c0>}
2025-05-07T20:32:44.3286624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3286811Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b8a95b0>
2025-05-07T20:32:44.3286815Z 
2025-05-07T20:32:44.3286979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3287244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3287354Z                            module_map=module_map)
2025-05-07T20:32:44.3287513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3287616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3287691Z E       ^
2025-05-07T20:32:44.3288047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3288051Z 
2025-05-07T20:32:44.3288457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3288461Z 
2025-05-07T20:32:44.3288564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3288789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3288871Z     T=16384,
2025-05-07T20:32:44.3288949Z     D=5120,
2025-05-07T20:32:44.3289031Z     scale_ub=None,
2025-05-07T20:32:44.3289116Z     contiguous=False,
2025-05-07T20:32:44.3289202Z     compiled=True,
2025-05-07T20:32:44.3289278Z )
2025-05-07T20:32:44.3289494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3289673Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3289677Z 
2025-05-07T20:32:44.3289752Z     @given(
2025-05-07T20:32:44.3289869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3289974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3290087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3290205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3290315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3290389Z     )
2025-05-07T20:32:44.3290713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3290806Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3290881Z         self,
2025-05-07T20:32:44.3290959Z         T: int,
2025-05-07T20:32:44.3291034Z         D: int,
2025-05-07T20:32:44.3291204Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3291296Z         contiguous: bool,
2025-05-07T20:32:44.3291380Z         compiled: bool,
2025-05-07T20:32:44.3291456Z     ) -> None:
2025-05-07T20:32:44.3291552Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3291622Z     
2025-05-07T20:32:44.3291789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3291861Z     
2025-05-07T20:32:44.3291952Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3292076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3292162Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3292241Z         x0 = x[:, :D]
2025-05-07T20:32:44.3292322Z         x1 = x[:, D:]
2025-05-07T20:32:44.3292393Z     
2025-05-07T20:32:44.3292481Z         if contiguous:
2025-05-07T20:32:44.3292573Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3292662Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3292733Z     
2025-05-07T20:32:44.3292829Z         if scale_ub is not None:
2025-05-07T20:32:44.3292932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3293063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3293140Z             )
2025-05-07T20:32:44.3293213Z         else:
2025-05-07T20:32:44.3293308Z             scale_ub_tensor = None
2025-05-07T20:32:44.3293379Z     
2025-05-07T20:32:44.3293506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3293598Z             op = silu_mul_quant
2025-05-07T20:32:44.3293680Z             if compiled:
2025-05-07T20:32:44.3293777Z                 op = torch.compile(op)
2025-05-07T20:32:44.3293882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3293953Z     
2025-05-07T20:32:44.3294046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3294050Z 
2025-05-07T20:32:44.3294151Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3294276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3294386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3294484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3294848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3294940Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3295428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3295522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3295878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3296101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3296441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3296533Z     kernel = self.compile(
2025-05-07T20:32:44.3296912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3297086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3297212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3297216Z 
2025-05-07T20:32:44.3297422Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea999e50>
2025-05-07T20:32:44.3298197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3298805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8ae1d00>}
2025-05-07T20:32:44.3299553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3299810Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be23ab0>
2025-05-07T20:32:44.3299815Z 
2025-05-07T20:32:44.3299981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3300238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3300342Z                            module_map=module_map)
2025-05-07T20:32:44.3300505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3300606Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3300681Z E       ^
2025-05-07T20:32:44.3301035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3301043Z 
2025-05-07T20:32:44.3301449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3301458Z 
2025-05-07T20:32:44.3301562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3301786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3301862Z     T=2048,
2025-05-07T20:32:44.3301938Z     D=5120,
2025-05-07T20:32:44.3302020Z     scale_ub=None,
2025-05-07T20:32:44.3302105Z     contiguous=False,
2025-05-07T20:32:44.3302186Z     compiled=True,
2025-05-07T20:32:44.3302260Z )
2025-05-07T20:32:44.3302475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3302649Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3302653Z 
2025-05-07T20:32:44.3302734Z     @given(
2025-05-07T20:32:44.3302851Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3302952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3303072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3303187Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3303305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3303377Z     )
2025-05-07T20:32:44.3303619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3303711Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3303786Z         self,
2025-05-07T20:32:44.3303864Z         T: int,
2025-05-07T20:32:44.3303940Z         D: int,
2025-05-07T20:32:44.3304035Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3304126Z         contiguous: bool,
2025-05-07T20:32:44.3304209Z         compiled: bool,
2025-05-07T20:32:44.3304286Z     ) -> None:
2025-05-07T20:32:44.3304389Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3304462Z     
2025-05-07T20:32:44.3304625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3304700Z     
2025-05-07T20:32:44.3304795Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3304917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3305006Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3305086Z         x0 = x[:, :D]
2025-05-07T20:32:44.3305169Z         x1 = x[:, D:]
2025-05-07T20:32:44.3305243Z     
2025-05-07T20:32:44.3305325Z         if contiguous:
2025-05-07T20:32:44.3305415Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3305501Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3305572Z     
2025-05-07T20:32:44.3305664Z         if scale_ub is not None:
2025-05-07T20:32:44.3305769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3305900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3305975Z             )
2025-05-07T20:32:44.3306133Z         else:
2025-05-07T20:32:44.3306228Z             scale_ub_tensor = None
2025-05-07T20:32:44.3306304Z     
2025-05-07T20:32:44.3306430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3306600Z             op = silu_mul_quant
2025-05-07T20:32:44.3306683Z             if compiled:
2025-05-07T20:32:44.3306780Z                 op = torch.compile(op)
2025-05-07T20:32:44.3306886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3306957Z     
2025-05-07T20:32:44.3307047Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3307051Z 
2025-05-07T20:32:44.3307149Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3307276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3307376Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3307478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3307842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3307942Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3308429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3308533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3308888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3309106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3309439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3309536Z     kernel = self.compile(
2025-05-07T20:32:44.3309910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3310084Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3310214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3310219Z 
2025-05-07T20:32:44.3310423Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea99bad0>
2025-05-07T20:32:44.3311204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3311703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be8ae14e0>}
2025-05-07T20:32:44.3312444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3312633Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be8ac30>
2025-05-07T20:32:44.3312641Z 
2025-05-07T20:32:44.3312811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3313072Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3313182Z                            module_map=module_map)
2025-05-07T20:32:44.3313344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3313441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3313516Z E       ^
2025-05-07T20:32:44.3313870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3313874Z 
2025-05-07T20:32:44.3314280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3314284Z 
2025-05-07T20:32:44.3314391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3314692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3314769Z     T=2048,
2025-05-07T20:32:44.3314849Z     D=5120,
2025-05-07T20:32:44.3314931Z     scale_ub=1200.0,
2025-05-07T20:32:44.3315014Z     contiguous=False,
2025-05-07T20:32:44.3315177Z     compiled=True,
2025-05-07T20:32:44.3315252Z )
2025-05-07T20:32:44.3315466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3315643Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3315648Z 
2025-05-07T20:32:44.3315770Z     @given(
2025-05-07T20:32:44.3315896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3315995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3316110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3316232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3316348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3316422Z     )
2025-05-07T20:32:44.3316674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3316768Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3316848Z         self,
2025-05-07T20:32:44.3316930Z         T: int,
2025-05-07T20:32:44.3317007Z         D: int,
2025-05-07T20:32:44.3317110Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3317200Z         contiguous: bool,
2025-05-07T20:32:44.3317287Z         compiled: bool,
2025-05-07T20:32:44.3317370Z     ) -> None:
2025-05-07T20:32:44.3317463Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3317539Z     
2025-05-07T20:32:44.3317710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3317785Z     
2025-05-07T20:32:44.3317878Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3318004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3318092Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3318171Z         x0 = x[:, :D]
2025-05-07T20:32:44.3318259Z         x1 = x[:, D:]
2025-05-07T20:32:44.3318333Z     
2025-05-07T20:32:44.3318420Z         if contiguous:
2025-05-07T20:32:44.3318511Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3318602Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3318684Z     
2025-05-07T20:32:44.3318775Z         if scale_ub is not None:
2025-05-07T20:32:44.3318881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3319017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3319092Z             )
2025-05-07T20:32:44.3319174Z         else:
2025-05-07T20:32:44.3319296Z             scale_ub_tensor = None
2025-05-07T20:32:44.3319375Z     
2025-05-07T20:32:44.3319523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3319618Z             op = silu_mul_quant
2025-05-07T20:32:44.3319703Z             if compiled:
2025-05-07T20:32:44.3319806Z                 op = torch.compile(op)
2025-05-07T20:32:44.3319910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3319988Z     
2025-05-07T20:32:44.3320083Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3320088Z 
2025-05-07T20:32:44.3320184Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3320312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3320420Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3320520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3320886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3320983Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3321473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3321574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3321930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3322236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3322581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3322773Z     kernel = self.compile(
2025-05-07T20:32:44.3323155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3323328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3323456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3323460Z 
2025-05-07T20:32:44.3323667Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c142716a0>
2025-05-07T20:32:44.3324443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3324953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be97fdf80>}
2025-05-07T20:32:44.3325703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3325891Z context = <triton._C.libtriton.ir.context object at 0x7f5b8be303b0>
2025-05-07T20:32:44.3325896Z 
2025-05-07T20:32:44.3326066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3326328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3326442Z                            module_map=module_map)
2025-05-07T20:32:44.3326602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3326705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3326788Z E       ^
2025-05-07T20:32:44.3327142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3327151Z 
2025-05-07T20:32:44.3327563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3327567Z 
2025-05-07T20:32:44.3327670Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3327893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3327974Z     T=4096,
2025-05-07T20:32:44.3328051Z     D=5120,
2025-05-07T20:32:44.3328135Z     scale_ub=1200.0,
2025-05-07T20:32:44.3328222Z     contiguous=True,
2025-05-07T20:32:44.3328306Z     compiled=True,
2025-05-07T20:32:44.3328380Z )
2025-05-07T20:32:44.3328601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3328777Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3328781Z 
2025-05-07T20:32:44.3328862Z     @given(
2025-05-07T20:32:44.3328982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3329085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3329203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3329320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3329434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3329512Z     )
2025-05-07T20:32:44.3329756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3329849Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3329929Z         self,
2025-05-07T20:32:44.3330008Z         T: int,
2025-05-07T20:32:44.3330092Z         D: int,
2025-05-07T20:32:44.3330191Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3330280Z         contiguous: bool,
2025-05-07T20:32:44.3330372Z         compiled: bool,
2025-05-07T20:32:44.3330580Z     ) -> None:
2025-05-07T20:32:44.3330677Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3330754Z     
2025-05-07T20:32:44.3330921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3331068Z     
2025-05-07T20:32:44.3331163Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3331286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3331375Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3331459Z         x0 = x[:, :D]
2025-05-07T20:32:44.3331541Z         x1 = x[:, D:]
2025-05-07T20:32:44.3331613Z     
2025-05-07T20:32:44.3331700Z         if contiguous:
2025-05-07T20:32:44.3331792Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3331886Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3331959Z     
2025-05-07T20:32:44.3332049Z         if scale_ub is not None:
2025-05-07T20:32:44.3332157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3332296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3332373Z             )
2025-05-07T20:32:44.3332452Z         else:
2025-05-07T20:32:44.3332548Z             scale_ub_tensor = None
2025-05-07T20:32:44.3332620Z     
2025-05-07T20:32:44.3332758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3332850Z             op = silu_mul_quant
2025-05-07T20:32:44.3332934Z             if compiled:
2025-05-07T20:32:44.3333038Z                 op = torch.compile(op)
2025-05-07T20:32:44.3333144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3333224Z     
2025-05-07T20:32:44.3333314Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3333318Z 
2025-05-07T20:32:44.3333416Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3333547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3333648Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3333749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3334122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3334216Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3334710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3334813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3335169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3335394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3335731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3335825Z     kernel = self.compile(
2025-05-07T20:32:44.3336208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3336388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3336524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3336528Z 
2025-05-07T20:32:44.3336731Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c04ab0c50>
2025-05-07T20:32:44.3337510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3338014Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5be97fe200>}
2025-05-07T20:32:44.3338759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3339032Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b33da30>
2025-05-07T20:32:44.3339037Z 
2025-05-07T20:32:44.3339202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3339608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3339715Z                            module_map=module_map)
2025-05-07T20:32:44.3339875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3339978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3340056Z E       ^
2025-05-07T20:32:44.3340411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3340415Z 
2025-05-07T20:32:44.3340830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3340835Z 
2025-05-07T20:32:44.3340936Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3341167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3341245Z     T=128,
2025-05-07T20:32:44.3341323Z     D=5120,
2025-05-07T20:32:44.3341418Z     scale_ub=1200.0,
2025-05-07T20:32:44.3341504Z     contiguous=False,
2025-05-07T20:32:44.3341589Z     compiled=True,
2025-05-07T20:32:44.3341666Z )
2025-05-07T20:32:44.3341885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3342058Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3342062Z 
2025-05-07T20:32:44.3342144Z     @given(
2025-05-07T20:32:44.3342262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3342364Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3342480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3342597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3342720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3342795Z     )
2025-05-07T20:32:44.3343037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3343136Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3343218Z         self,
2025-05-07T20:32:44.3343296Z         T: int,
2025-05-07T20:32:44.3343376Z         D: int,
2025-05-07T20:32:44.3343475Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3343567Z         contiguous: bool,
2025-05-07T20:32:44.3343654Z         compiled: bool,
2025-05-07T20:32:44.3343732Z     ) -> None:
2025-05-07T20:32:44.3343832Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3343904Z     
2025-05-07T20:32:44.3344071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3344149Z     
2025-05-07T20:32:44.3344242Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3344369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3344462Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3344549Z         x0 = x[:, :D]
2025-05-07T20:32:44.3344631Z         x1 = x[:, D:]
2025-05-07T20:32:44.3344707Z     
2025-05-07T20:32:44.3344790Z         if contiguous:
2025-05-07T20:32:44.3344882Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3344978Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3345052Z     
2025-05-07T20:32:44.3345145Z         if scale_ub is not None:
2025-05-07T20:32:44.3345254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3345389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3345468Z             )
2025-05-07T20:32:44.3345544Z         else:
2025-05-07T20:32:44.3345640Z             scale_ub_tensor = None
2025-05-07T20:32:44.3345715Z     
2025-05-07T20:32:44.3345846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3345939Z             op = silu_mul_quant
2025-05-07T20:32:44.3346027Z             if compiled:
2025-05-07T20:32:44.3346129Z                 op = torch.compile(op)
2025-05-07T20:32:44.3346322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3346400Z     
2025-05-07T20:32:44.3346492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3346497Z 
2025-05-07T20:32:44.3346673Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3346803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3346903Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3347010Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3347376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3347469Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3347963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3348061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3348424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3348647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3348985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3349090Z     kernel = self.compile(
2025-05-07T20:32:44.3349494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3349692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3349825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3349830Z 
2025-05-07T20:32:44.3350031Z self = <triton.compiler.compiler.ASTSource object at 0x7f5c05ed26c0>
2025-05-07T20:32:44.3350817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3351320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea294c20>}
2025-05-07T20:32:44.3352072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3352264Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b6f4730>
2025-05-07T20:32:44.3352268Z 
2025-05-07T20:32:44.3352433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3352697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3352805Z                            module_map=module_map)
2025-05-07T20:32:44.3352973Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3353076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3353158Z E       ^
2025-05-07T20:32:44.3353515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3353524Z 
2025-05-07T20:32:44.3353934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3353938Z 
2025-05-07T20:32:44.3354042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3354269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3354348Z     T=16384,
2025-05-07T20:32:44.3354430Z     D=7168,
2025-05-07T20:32:44.3354515Z     scale_ub=1200.0,
2025-05-07T20:32:44.3354601Z     contiguous=True,
2025-05-07T20:32:44.3354688Z     compiled=True,
2025-05-07T20:32:44.3354766Z )
2025-05-07T20:32:44.3354984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3355247Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3355252Z 
2025-05-07T20:32:44.3355332Z     @given(
2025-05-07T20:32:44.3355451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3355652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3355816Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3355935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3356052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3356129Z     )
2025-05-07T20:32:44.3356376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3356470Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3356546Z         self,
2025-05-07T20:32:44.3356628Z         T: int,
2025-05-07T20:32:44.3356706Z         D: int,
2025-05-07T20:32:44.3356805Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3356899Z         contiguous: bool,
2025-05-07T20:32:44.3356989Z         compiled: bool,
2025-05-07T20:32:44.3357069Z     ) -> None:
2025-05-07T20:32:44.3357169Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3357243Z     
2025-05-07T20:32:44.3357413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3357491Z     
2025-05-07T20:32:44.3357587Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3357715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3357803Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3357885Z         x0 = x[:, :D]
2025-05-07T20:32:44.3357967Z         x1 = x[:, D:]
2025-05-07T20:32:44.3358040Z     
2025-05-07T20:32:44.3358123Z         if contiguous:
2025-05-07T20:32:44.3358217Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3358306Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3358381Z     
2025-05-07T20:32:44.3358474Z         if scale_ub is not None:
2025-05-07T20:32:44.3358581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3358719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3358802Z             )
2025-05-07T20:32:44.3358878Z         else:
2025-05-07T20:32:44.3358974Z             scale_ub_tensor = None
2025-05-07T20:32:44.3359052Z     
2025-05-07T20:32:44.3359183Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3359277Z             op = silu_mul_quant
2025-05-07T20:32:44.3359361Z             if compiled:
2025-05-07T20:32:44.3359461Z                 op = torch.compile(op)
2025-05-07T20:32:44.3359568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3359641Z     
2025-05-07T20:32:44.3359731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3359736Z 
2025-05-07T20:32:44.3359834Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3359963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3360063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3360166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3360536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3360631Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3361126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3361222Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3361582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3361805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3362148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3362242Z     kernel = self.compile(
2025-05-07T20:32:44.3362708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3362887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3363018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3363096Z 
2025-05-07T20:32:44.3363302Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d21f10>
2025-05-07T20:32:44.3364080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3364582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea296a20>}
2025-05-07T20:32:44.3365332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3365856Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b3cabb0>
2025-05-07T20:32:44.3365862Z 
2025-05-07T20:32:44.3366031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3366302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3366410Z                            module_map=module_map)
2025-05-07T20:32:44.3366578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3366679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3366759Z E       ^
2025-05-07T20:32:44.3367117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3367121Z 
2025-05-07T20:32:44.3367531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3367536Z 
2025-05-07T20:32:44.3367648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3367871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3367952Z     T=16384,
2025-05-07T20:32:44.3368039Z     D=5120,
2025-05-07T20:32:44.3368124Z     scale_ub=1200.0,
2025-05-07T20:32:44.3368211Z     contiguous=True,
2025-05-07T20:32:44.3368299Z     compiled=False,
2025-05-07T20:32:44.3368374Z )
2025-05-07T20:32:44.3368595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3368772Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3368776Z 
2025-05-07T20:32:44.3368856Z     @given(
2025-05-07T20:32:44.3368980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3369081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3369199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3369323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3369438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3369514Z     )
2025-05-07T20:32:44.3369761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3369859Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3369940Z         self,
2025-05-07T20:32:44.3370019Z         T: int,
2025-05-07T20:32:44.3370101Z         D: int,
2025-05-07T20:32:44.3370204Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3370297Z         contiguous: bool,
2025-05-07T20:32:44.3370386Z         compiled: bool,
2025-05-07T20:32:44.3370469Z     ) -> None:
2025-05-07T20:32:44.3370566Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3370640Z     
2025-05-07T20:32:44.3370810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3370885Z     
2025-05-07T20:32:44.3370977Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3371104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3371334Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3371422Z         x0 = x[:, :D]
2025-05-07T20:32:44.3371502Z         x1 = x[:, D:]
2025-05-07T20:32:44.3371576Z     
2025-05-07T20:32:44.3371778Z         if contiguous:
2025-05-07T20:32:44.3371869Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3371963Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3372039Z     
2025-05-07T20:32:44.3372130Z         if scale_ub is not None:
2025-05-07T20:32:44.3372238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3372376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3372451Z             )
2025-05-07T20:32:44.3372527Z         else:
2025-05-07T20:32:44.3372626Z             scale_ub_tensor = None
2025-05-07T20:32:44.3372699Z     
2025-05-07T20:32:44.3372828Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3372925Z             op = silu_mul_quant
2025-05-07T20:32:44.3373011Z             if compiled:
2025-05-07T20:32:44.3373122Z                 op = torch.compile(op)
2025-05-07T20:32:44.3373227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3373300Z     
2025-05-07T20:32:44.3373394Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3373404Z 
2025-05-07T20:32:44.3373501Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3373629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3373734Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3373835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3374337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3374434Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3374793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3375022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3375362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3375454Z     kernel = self.compile(
2025-05-07T20:32:44.3375842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3376021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3376148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3376156Z 
2025-05-07T20:32:44.3376359Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8d22780>
2025-05-07T20:32:44.3377137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3377654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea19f9c0>}
2025-05-07T20:32:44.3378400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3382441Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2bf730>
2025-05-07T20:32:44.3382449Z 
2025-05-07T20:32:44.3382622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3382888Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3382998Z                            module_map=module_map)
2025-05-07T20:32:44.3383158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3383259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3383336Z E       ^
2025-05-07T20:32:44.3383796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3383806Z 
2025-05-07T20:32:44.3384221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3384301Z 
2025-05-07T20:32:44.3384404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3384632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3384708Z     T=1,
2025-05-07T20:32:44.3384784Z     D=7168,
2025-05-07T20:32:44.3384870Z     scale_ub=1200.0,
2025-05-07T20:32:44.3384955Z     contiguous=False,
2025-05-07T20:32:44.3385039Z     compiled=False,
2025-05-07T20:32:44.3385119Z )
2025-05-07T20:32:44.3385337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3385506Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3385510Z 
2025-05-07T20:32:44.3385593Z     @given(
2025-05-07T20:32:44.3385713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3385813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3385934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3386049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3386163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3386236Z     )
2025-05-07T20:32:44.3386480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3386574Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3386654Z         self,
2025-05-07T20:32:44.3386732Z         T: int,
2025-05-07T20:32:44.3386808Z         D: int,
2025-05-07T20:32:44.3386908Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3386995Z         contiguous: bool,
2025-05-07T20:32:44.3387079Z         compiled: bool,
2025-05-07T20:32:44.3387163Z     ) -> None:
2025-05-07T20:32:44.3387262Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3387336Z     
2025-05-07T20:32:44.3387508Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3387581Z     
2025-05-07T20:32:44.3387679Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3387806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3387894Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3387974Z         x0 = x[:, :D]
2025-05-07T20:32:44.3388055Z         x1 = x[:, D:]
2025-05-07T20:32:44.3388127Z     
2025-05-07T20:32:44.3388216Z         if contiguous:
2025-05-07T20:32:44.3388306Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3388395Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3388471Z     
2025-05-07T20:32:44.3388561Z         if scale_ub is not None:
2025-05-07T20:32:44.3388665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3388804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3388879Z             )
2025-05-07T20:32:44.3388959Z         else:
2025-05-07T20:32:44.3389058Z             scale_ub_tensor = None
2025-05-07T20:32:44.3389131Z     
2025-05-07T20:32:44.3389260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3389359Z             op = silu_mul_quant
2025-05-07T20:32:44.3389442Z             if compiled:
2025-05-07T20:32:44.3389546Z                 op = torch.compile(op)
2025-05-07T20:32:44.3389651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3389725Z     
2025-05-07T20:32:44.3389820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3389825Z 
2025-05-07T20:32:44.3389923Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3390051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3390156Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3390256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3390859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3390961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3391322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3391622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3391958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3392052Z     kernel = self.compile(
2025-05-07T20:32:44.3392440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3392612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3392742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3392746Z 
2025-05-07T20:32:44.3392955Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be8b8c3b0>
2025-05-07T20:32:44.3393733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3394250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c5800>}
2025-05-07T20:32:44.3394996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3395190Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b25ebf0>
2025-05-07T20:32:44.3395194Z 
2025-05-07T20:32:44.3395357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3395621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3395805Z                            module_map=module_map)
2025-05-07T20:32:44.3395968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3396085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3396162Z E       ^
2025-05-07T20:32:44.3396513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3396518Z 
2025-05-07T20:32:44.3396929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3396933Z 
2025-05-07T20:32:44.3397035Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3397260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3397337Z     T=4096,
2025-05-07T20:32:44.3397412Z     D=7168,
2025-05-07T20:32:44.3397500Z     scale_ub=1200.0,
2025-05-07T20:32:44.3397594Z     contiguous=False,
2025-05-07T20:32:44.3397677Z     compiled=True,
2025-05-07T20:32:44.3397753Z )
2025-05-07T20:32:44.3397967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3398147Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3398151Z 
2025-05-07T20:32:44.3398232Z     @given(
2025-05-07T20:32:44.3398351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3398454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3398568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3398683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3398797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3398872Z     )
2025-05-07T20:32:44.3399114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3399211Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3399372Z         self,
2025-05-07T20:32:44.3399449Z         T: int,
2025-05-07T20:32:44.3399529Z         D: int,
2025-05-07T20:32:44.3399628Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3399715Z         contiguous: bool,
2025-05-07T20:32:44.3399883Z         compiled: bool,
2025-05-07T20:32:44.3399960Z     ) -> None:
2025-05-07T20:32:44.3400057Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3400129Z     
2025-05-07T20:32:44.3400295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3400372Z     
2025-05-07T20:32:44.3400462Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3400587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3400680Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3400759Z         x0 = x[:, :D]
2025-05-07T20:32:44.3400839Z         x1 = x[:, D:]
2025-05-07T20:32:44.3400915Z     
2025-05-07T20:32:44.3400997Z         if contiguous:
2025-05-07T20:32:44.3401088Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3401186Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3401259Z     
2025-05-07T20:32:44.3401349Z         if scale_ub is not None:
2025-05-07T20:32:44.3401458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3401601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3401683Z             )
2025-05-07T20:32:44.3401759Z         else:
2025-05-07T20:32:44.3401855Z             scale_ub_tensor = None
2025-05-07T20:32:44.3401930Z     
2025-05-07T20:32:44.3402058Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3402149Z             op = silu_mul_quant
2025-05-07T20:32:44.3402235Z             if compiled:
2025-05-07T20:32:44.3402333Z                 op = torch.compile(op)
2025-05-07T20:32:44.3402437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3402513Z     
2025-05-07T20:32:44.3402605Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3402609Z 
2025-05-07T20:32:44.3402709Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3402843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3402943Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3403048Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3403418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3403509Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3404001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3404096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3404454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3404676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3405015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3405112Z     kernel = self.compile(
2025-05-07T20:32:44.3405488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3405666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3405797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3405802Z 
2025-05-07T20:32:44.3406005Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be96d8e90>
2025-05-07T20:32:44.3406790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3407376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5bea6c6d40>}
2025-05-07T20:32:44.3408123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3408385Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b576630>
2025-05-07T20:32:44.3408389Z 
2025-05-07T20:32:44.3408552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3408816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3408924Z                            module_map=module_map)
2025-05-07T20:32:44.3409083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3409184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3409262Z E       ^
2025-05-07T20:32:44.3409623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3409628Z 
2025-05-07T20:32:44.3410037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3410046Z 
2025-05-07T20:32:44.3410148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3410374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3410451Z     T=128,
2025-05-07T20:32:44.3410531Z     D=7168,
2025-05-07T20:32:44.3410614Z     scale_ub=1200.0,
2025-05-07T20:32:44.3410699Z     contiguous=False,
2025-05-07T20:32:44.3410784Z     compiled=True,
2025-05-07T20:32:44.3410856Z )
2025-05-07T20:32:44.3411072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3411248Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.3411252Z 
2025-05-07T20:32:44.3411329Z     @given(
2025-05-07T20:32:44.3411448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3411559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3411673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3411790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3411910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3411985Z     )
2025-05-07T20:32:44.3412230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3412326Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3412403Z         self,
2025-05-07T20:32:44.3412487Z         T: int,
2025-05-07T20:32:44.3412562Z         D: int,
2025-05-07T20:32:44.3412660Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3412753Z         contiguous: bool,
2025-05-07T20:32:44.3412838Z         compiled: bool,
2025-05-07T20:32:44.3412917Z     ) -> None:
2025-05-07T20:32:44.3413016Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3413090Z     
2025-05-07T20:32:44.3413264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3413338Z     
2025-05-07T20:32:44.3413429Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3413555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3413647Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3413725Z         x0 = x[:, :D]
2025-05-07T20:32:44.3413808Z         x1 = x[:, D:]
2025-05-07T20:32:44.3413880Z     
2025-05-07T20:32:44.3413962Z         if contiguous:
2025-05-07T20:32:44.3414056Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3414143Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3414218Z     
2025-05-07T20:32:44.3414310Z         if scale_ub is not None:
2025-05-07T20:32:44.3414416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3414548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3414626Z             )
2025-05-07T20:32:44.3414701Z         else:
2025-05-07T20:32:44.3414797Z             scale_ub_tensor = None
2025-05-07T20:32:44.3414868Z     
2025-05-07T20:32:44.3415080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3415173Z             op = silu_mul_quant
2025-05-07T20:32:44.3415258Z             if compiled:
2025-05-07T20:32:44.3415432Z                 op = torch.compile(op)
2025-05-07T20:32:44.3415538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3415610Z     
2025-05-07T20:32:44.3415700Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3415704Z 
2025-05-07T20:32:44.3415801Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3415927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3416032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3416131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3416496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3416592Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3417086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3417185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3417549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3417768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3418106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3418199Z     kernel = self.compile(
2025-05-07T20:32:44.3418576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3418753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3418879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3418884Z 
2025-05-07T20:32:44.3419097Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be9826db0>
2025-05-07T20:32:44.3419871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3420375Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5c04ae2160>}
2025-05-07T20:32:44.3421119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3421308Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b57f7f0>
2025-05-07T20:32:44.3421312Z 
2025-05-07T20:32:44.3421484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3421747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3421853Z                            module_map=module_map)
2025-05-07T20:32:44.3422019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3422117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3422196Z E       ^
2025-05-07T20:32:44.3422552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3422556Z 
2025-05-07T20:32:44.3422965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3422970Z 
2025-05-07T20:32:44.3423077Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3423299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3423375Z     T=2048,
2025-05-07T20:32:44.3423455Z     D=7168,
2025-05-07T20:32:44.3423642Z     scale_ub=None,
2025-05-07T20:32:44.3423730Z     contiguous=True,
2025-05-07T20:32:44.3423815Z     compiled=True,
2025-05-07T20:32:44.3423891Z )
2025-05-07T20:32:44.3424184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3424356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.3424361Z 
2025-05-07T20:32:44.3424437Z     @given(
2025-05-07T20:32:44.3424556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3424657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3424771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3424888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3425002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3425077Z     )
2025-05-07T20:32:44.3425321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3425418Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3425494Z         self,
2025-05-07T20:32:44.3425576Z         T: int,
2025-05-07T20:32:44.3425651Z         D: int,
2025-05-07T20:32:44.3425749Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3425846Z         contiguous: bool,
2025-05-07T20:32:44.3425931Z         compiled: bool,
2025-05-07T20:32:44.3426011Z     ) -> None:
2025-05-07T20:32:44.3426112Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3426184Z     
2025-05-07T20:32:44.3426349Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3426427Z     
2025-05-07T20:32:44.3426519Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3426641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3426732Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3426811Z         x0 = x[:, :D]
2025-05-07T20:32:44.3426893Z         x1 = x[:, D:]
2025-05-07T20:32:44.3426966Z     
2025-05-07T20:32:44.3427048Z         if contiguous:
2025-05-07T20:32:44.3427152Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3427243Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3427316Z     
2025-05-07T20:32:44.3427407Z         if scale_ub is not None:
2025-05-07T20:32:44.3427518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3427651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3427728Z             )
2025-05-07T20:32:44.3427804Z         else:
2025-05-07T20:32:44.3427898Z             scale_ub_tensor = None
2025-05-07T20:32:44.3427976Z     
2025-05-07T20:32:44.3428103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3428194Z             op = silu_mul_quant
2025-05-07T20:32:44.3428279Z             if compiled:
2025-05-07T20:32:44.3428377Z                 op = torch.compile(op)
2025-05-07T20:32:44.3428484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3428556Z     
2025-05-07T20:32:44.3428645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3428650Z 
2025-05-07T20:32:44.3428753Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3428882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3428982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3429090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3429455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3429550Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3430039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3430136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3430493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3430712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3431132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3431229Z     kernel = self.compile(
2025-05-07T20:32:44.3431606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3431869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3431998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3432002Z 
2025-05-07T20:32:44.3432204Z self = <triton.compiler.compiler.ASTSource object at 0x7f5be93a9790>
2025-05-07T20:32:44.3432982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3433487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8bdd18a0>}
2025-05-07T20:32:44.3434233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3434429Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2d3cf0>
2025-05-07T20:32:44.3434434Z 
2025-05-07T20:32:44.3434597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3434859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3434965Z                            module_map=module_map)
2025-05-07T20:32:44.3435128Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3435226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3435304Z E       ^
2025-05-07T20:32:44.3435665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3435669Z 
2025-05-07T20:32:44.3436135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3436145Z 
2025-05-07T20:32:44.3436252Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3436473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3436552Z     T=16384,
2025-05-07T20:32:44.3436632Z     D=5120,
2025-05-07T20:32:44.3436714Z     scale_ub=None,
2025-05-07T20:32:44.3436800Z     contiguous=False,
2025-05-07T20:32:44.3436886Z     compiled=False,
2025-05-07T20:32:44.3436958Z )
2025-05-07T20:32:44.3437174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3437353Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3437358Z 
2025-05-07T20:32:44.3437434Z     @given(
2025-05-07T20:32:44.3437563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3437660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3437774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3437898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3438011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3438084Z     )
2025-05-07T20:32:44.3438331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3438424Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3438508Z         self,
2025-05-07T20:32:44.3438586Z         T: int,
2025-05-07T20:32:44.3438663Z         D: int,
2025-05-07T20:32:44.3438765Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3438852Z         contiguous: bool,
2025-05-07T20:32:44.3438937Z         compiled: bool,
2025-05-07T20:32:44.3439019Z     ) -> None:
2025-05-07T20:32:44.3439114Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3439187Z     
2025-05-07T20:32:44.3439835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3439912Z     
2025-05-07T20:32:44.3440003Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3440205Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3442019Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3442025Z 
2025-05-07T20:32:44.3442146Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.3442150Z 
2025-05-07T20:32:44.3442259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3442489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3442566Z     T=4096,
2025-05-07T20:32:44.3442647Z     D=7168,
2025-05-07T20:32:44.3442735Z     scale_ub=1200.0,
2025-05-07T20:32:44.3442818Z     contiguous=True,
2025-05-07T20:32:44.3442901Z     compiled=True,
2025-05-07T20:32:44.3442976Z )
2025-05-07T20:32:44.3443189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3443359Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3443363Z 
2025-05-07T20:32:44.3443444Z     @given(
2025-05-07T20:32:44.3443561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3443660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3443776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3443891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3444009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3444083Z     )
2025-05-07T20:32:44.3444323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3444424Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3444500Z         self,
2025-05-07T20:32:44.3444576Z         T: int,
2025-05-07T20:32:44.3444655Z         D: int,
2025-05-07T20:32:44.3444752Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3444840Z         contiguous: bool,
2025-05-07T20:32:44.3444929Z         compiled: bool,
2025-05-07T20:32:44.3445006Z     ) -> None:
2025-05-07T20:32:44.3445105Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3445178Z     
2025-05-07T20:32:44.3445342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3445420Z     
2025-05-07T20:32:44.3445511Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3445634Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3447437Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3447447Z 
2025-05-07T20:32:44.3447565Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.3447569Z 
2025-05-07T20:32:44.3447673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3447893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3447972Z     T=16384,
2025-05-07T20:32:44.3448050Z     D=7168,
2025-05-07T20:32:44.3448214Z     scale_ub=None,
2025-05-07T20:32:44.3448305Z     contiguous=False,
2025-05-07T20:32:44.3448388Z     compiled=False,
2025-05-07T20:32:44.3448466Z )
2025-05-07T20:32:44.3448683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3448956Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3448960Z 
2025-05-07T20:32:44.3449038Z     @given(
2025-05-07T20:32:44.3449164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3449262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3449374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3449492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3449602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3449684Z     )
2025-05-07T20:32:44.3449926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3450018Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3450103Z         self,
2025-05-07T20:32:44.3450180Z         T: int,
2025-05-07T20:32:44.3450257Z         D: int,
2025-05-07T20:32:44.3450358Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3450452Z         contiguous: bool,
2025-05-07T20:32:44.3450536Z         compiled: bool,
2025-05-07T20:32:44.3450616Z     ) -> None:
2025-05-07T20:32:44.3450710Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3450783Z     
2025-05-07T20:32:44.3450950Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3452752Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3452763Z 
2025-05-07T20:32:44.3452877Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3452885Z 
2025-05-07T20:32:44.3452986Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3453211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3453290Z     T=2048,
2025-05-07T20:32:44.3453365Z     D=7168,
2025-05-07T20:32:44.3453450Z     scale_ub=1200.0,
2025-05-07T20:32:44.3453534Z     contiguous=True,
2025-05-07T20:32:44.3453619Z     compiled=True,
2025-05-07T20:32:44.3453694Z )
2025-05-07T20:32:44.3453909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3454078Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3454086Z 
2025-05-07T20:32:44.3454162Z     @given(
2025-05-07T20:32:44.3454284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3454384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3454496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3454616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3454730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3454802Z     )
2025-05-07T20:32:44.3455042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3455139Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3455214Z         self,
2025-05-07T20:32:44.3455295Z         T: int,
2025-05-07T20:32:44.3455370Z         D: int,
2025-05-07T20:32:44.3455465Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3455553Z         contiguous: bool,
2025-05-07T20:32:44.3455636Z         compiled: bool,
2025-05-07T20:32:44.3455711Z     ) -> None:
2025-05-07T20:32:44.3455806Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3455878Z     
2025-05-07T20:32:44.3456123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3456198Z     
2025-05-07T20:32:44.3456288Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3456414Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3458268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3458274Z 
2025-05-07T20:32:44.3458387Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.3458392Z 
2025-05-07T20:32:44.3458501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3458719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3458798Z     T=2048,
2025-05-07T20:32:44.3458882Z     D=7168,
2025-05-07T20:32:44.3458963Z     scale_ub=None,
2025-05-07T20:32:44.3459049Z     contiguous=True,
2025-05-07T20:32:44.3459130Z     compiled=False,
2025-05-07T20:32:44.3459201Z )
2025-05-07T20:32:44.3459417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3459585Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3459590Z 
2025-05-07T20:32:44.3459666Z     @given(
2025-05-07T20:32:44.3459787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3459882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3459996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3460109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3460224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3460302Z     )
2025-05-07T20:32:44.3460539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3460636Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3460712Z         self,
2025-05-07T20:32:44.3460786Z         T: int,
2025-05-07T20:32:44.3460860Z         D: int,
2025-05-07T20:32:44.3460957Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3461043Z         contiguous: bool,
2025-05-07T20:32:44.3461126Z         compiled: bool,
2025-05-07T20:32:44.3461204Z     ) -> None:
2025-05-07T20:32:44.3461297Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3461375Z     
2025-05-07T20:32:44.3461538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3461610Z     
2025-05-07T20:32:44.3461703Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.3463489Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3463500Z 
2025-05-07T20:32:44.3463621Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.3463626Z 
2025-05-07T20:32:44.3463725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3463943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3464025Z     T=1,
2025-05-07T20:32:44.3464099Z     D=7168,
2025-05-07T20:32:44.3464178Z     scale_ub=1200.0,
2025-05-07T20:32:44.3464264Z     contiguous=True,
2025-05-07T20:32:44.3464345Z     compiled=False,
2025-05-07T20:32:44.3464508Z )
2025-05-07T20:32:44.3464725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3464885Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3464963Z 
2025-05-07T20:32:44.3465040Z     @given(
2025-05-07T20:32:44.3465155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3465251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3465719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3465892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3466048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3466163Z     )
2025-05-07T20:32:44.3466424Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3466519Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3466595Z         self,
2025-05-07T20:32:44.3466670Z         T: int,
2025-05-07T20:32:44.3466749Z         D: int,
2025-05-07T20:32:44.3466852Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3466939Z         contiguous: bool,
2025-05-07T20:32:44.3467028Z         compiled: bool,
2025-05-07T20:32:44.3467105Z     ) -> None:
2025-05-07T20:32:44.3467203Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3467280Z     
2025-05-07T20:32:44.3467444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3467517Z     
2025-05-07T20:32:44.3467615Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3467736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3467828Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3467907Z         x0 = x[:, :D]
2025-05-07T20:32:44.3467984Z         x1 = x[:, D:]
2025-05-07T20:32:44.3468059Z     
2025-05-07T20:32:44.3468140Z         if contiguous:
2025-05-07T20:32:44.3468229Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3468319Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3468390Z     
2025-05-07T20:32:44.3468484Z         if scale_ub is not None:
2025-05-07T20:32:44.3468593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3468724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3468803Z             )
2025-05-07T20:32:44.3468880Z         else:
2025-05-07T20:32:44.3468972Z             scale_ub_tensor = None
2025-05-07T20:32:44.3469043Z     
2025-05-07T20:32:44.3469171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3469260Z             op = silu_mul_quant
2025-05-07T20:32:44.3469345Z             if compiled:
2025-05-07T20:32:44.3469443Z                 op = torch.compile(op)
2025-05-07T20:32:44.3469546Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3469621Z     
2025-05-07T20:32:44.3469709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3469714Z 
2025-05-07T20:32:44.3469810Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3469939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3470042Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3470143Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3470645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3470745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3471104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3471322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3471658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3471754Z     kernel = self.compile(
2025-05-07T20:32:44.3472133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3472545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3472675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3472680Z 
2025-05-07T20:32:44.3472881Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea31a210>
2025-05-07T20:32:44.3473767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3474267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b260f40>}
2025-05-07T20:32:44.3475011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3475203Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b2e49b0>
2025-05-07T20:32:44.3475208Z 
2025-05-07T20:32:44.3475370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3475639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3475804Z                            module_map=module_map)
2025-05-07T20:32:44.3475966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3476063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3476140Z E       ^
2025-05-07T20:32:44.3476497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3476501Z 
2025-05-07T20:32:44.3476911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3476916Z 
2025-05-07T20:32:44.3477019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3477244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3477320Z     T=128,
2025-05-07T20:32:44.3477398Z     D=5120,
2025-05-07T20:32:44.3477482Z     scale_ub=None,
2025-05-07T20:32:44.3477564Z     contiguous=True,
2025-05-07T20:32:44.3477651Z     compiled=False,
2025-05-07T20:32:44.3477724Z )
2025-05-07T20:32:44.3477940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3478110Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3478114Z 
2025-05-07T20:32:44.3478190Z     @given(
2025-05-07T20:32:44.3478309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3478407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3478520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3478637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3478753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3478826Z     )
2025-05-07T20:32:44.3479068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3479165Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3479244Z         self,
2025-05-07T20:32:44.3479327Z         T: int,
2025-05-07T20:32:44.3479401Z         D: int,
2025-05-07T20:32:44.3479499Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3479590Z         contiguous: bool,
2025-05-07T20:32:44.3479674Z         compiled: bool,
2025-05-07T20:32:44.3479750Z     ) -> None:
2025-05-07T20:32:44.3479845Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3479917Z     
2025-05-07T20:32:44.3480087Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3480159Z     
2025-05-07T20:32:44.3480249Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3480373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3480459Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3480621Z         x0 = x[:, :D]
2025-05-07T20:32:44.3480706Z         x1 = x[:, D:]
2025-05-07T20:32:44.3480778Z     
2025-05-07T20:32:44.3480861Z         if contiguous:
2025-05-07T20:32:44.3480954Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3481146Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3481217Z     
2025-05-07T20:32:44.3481308Z         if scale_ub is not None:
2025-05-07T20:32:44.3481412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3481547Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3481621Z             )
2025-05-07T20:32:44.3481696Z         else:
2025-05-07T20:32:44.3481792Z             scale_ub_tensor = None
2025-05-07T20:32:44.3481863Z     
2025-05-07T20:32:44.3481989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3482082Z             op = silu_mul_quant
2025-05-07T20:32:44.3482164Z             if compiled:
2025-05-07T20:32:44.3482263Z                 op = torch.compile(op)
2025-05-07T20:32:44.3482374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3482445Z     
2025-05-07T20:32:44.3482534Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3482538Z 
2025-05-07T20:32:44.3482639Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3482772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3482871Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3482968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3483461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3483559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3483913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3484130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3484469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3484562Z     kernel = self.compile(
2025-05-07T20:32:44.3484940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3485117Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3485242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3485247Z 
2025-05-07T20:32:44.3485450Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea2c6e10>
2025-05-07T20:32:44.3486221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3486725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b262020>}
2025-05-07T20:32:44.3487463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3487655Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b463fb0>
2025-05-07T20:32:44.3487662Z 
2025-05-07T20:32:44.3487825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3488086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3488196Z                            module_map=module_map)
2025-05-07T20:32:44.3488358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3488458Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3488537Z E       ^
2025-05-07T20:32:44.3488971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3488976Z 
2025-05-07T20:32:44.3489391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3489468Z 
2025-05-07T20:32:44.3489569Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3489790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3489868Z     T=128,
2025-05-07T20:32:44.3489943Z     D=7168,
2025-05-07T20:32:44.3490022Z     scale_ub=None,
2025-05-07T20:32:44.3490109Z     contiguous=True,
2025-05-07T20:32:44.3490190Z     compiled=False,
2025-05-07T20:32:44.3490261Z )
2025-05-07T20:32:44.3490480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3490647Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3490651Z 
2025-05-07T20:32:44.3490734Z     @given(
2025-05-07T20:32:44.3490857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3490957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3491074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3491192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3491304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3491380Z     )
2025-05-07T20:32:44.3491619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3491712Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3491787Z         self,
2025-05-07T20:32:44.3491862Z         T: int,
2025-05-07T20:32:44.3491940Z         D: int,
2025-05-07T20:32:44.3492037Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3492124Z         contiguous: bool,
2025-05-07T20:32:44.3492212Z         compiled: bool,
2025-05-07T20:32:44.3492289Z     ) -> None:
2025-05-07T20:32:44.3492385Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3492459Z     
2025-05-07T20:32:44.3492629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3492702Z     
2025-05-07T20:32:44.3492797Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3492919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3493011Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3493093Z         x0 = x[:, :D]
2025-05-07T20:32:44.3493172Z         x1 = x[:, D:]
2025-05-07T20:32:44.3493246Z     
2025-05-07T20:32:44.3493329Z         if contiguous:
2025-05-07T20:32:44.3493418Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3493508Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3493579Z     
2025-05-07T20:32:44.3493668Z         if scale_ub is not None:
2025-05-07T20:32:44.3493775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3493905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3493977Z             )
2025-05-07T20:32:44.3494054Z         else:
2025-05-07T20:32:44.3494150Z             scale_ub_tensor = None
2025-05-07T20:32:44.3494223Z     
2025-05-07T20:32:44.3494353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3494442Z             op = silu_mul_quant
2025-05-07T20:32:44.3494534Z             if compiled:
2025-05-07T20:32:44.3494630Z                 op = torch.compile(op)
2025-05-07T20:32:44.3494733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3494808Z     
2025-05-07T20:32:44.3494898Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3494902Z 
2025-05-07T20:32:44.3494998Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3495128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3495225Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3495322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3495820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3495998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3496358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3496581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3496996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3497092Z     kernel = self.compile(
2025-05-07T20:32:44.3497469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3497642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3497767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3497771Z 
2025-05-07T20:32:44.3497973Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea2c7080>
2025-05-07T20:32:44.3498752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3499256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b262f20>}
2025-05-07T20:32:44.3499999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3500187Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b148cf0>
2025-05-07T20:32:44.3500191Z 
2025-05-07T20:32:44.3500353Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3500618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3500729Z                            module_map=module_map)
2025-05-07T20:32:44.3500895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3500995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3501077Z E       ^
2025-05-07T20:32:44.3501440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3501444Z 
2025-05-07T20:32:44.3501849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3501854Z 
2025-05-07T20:32:44.3501958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3502185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3502264Z     T=2048,
2025-05-07T20:32:44.3506228Z     D=7168,
2025-05-07T20:32:44.3506329Z     scale_ub=1200.0,
2025-05-07T20:32:44.3506419Z     contiguous=True,
2025-05-07T20:32:44.3506508Z     compiled=False,
2025-05-07T20:32:44.3506590Z )
2025-05-07T20:32:44.3506817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3506992Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3507001Z 
2025-05-07T20:32:44.3507079Z     @given(
2025-05-07T20:32:44.3507203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3507301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3507414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3507534Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3507646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3507721Z     )
2025-05-07T20:32:44.3507969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3508062Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3508143Z         self,
2025-05-07T20:32:44.3508224Z         T: int,
2025-05-07T20:32:44.3508403Z         D: int,
2025-05-07T20:32:44.3508509Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3508597Z         contiguous: bool,
2025-05-07T20:32:44.3508683Z         compiled: bool,
2025-05-07T20:32:44.3508844Z     ) -> None:
2025-05-07T20:32:44.3508940Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3509015Z     
2025-05-07T20:32:44.3509189Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3510988Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3510994Z 
2025-05-07T20:32:44.3511116Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3511120Z 
2025-05-07T20:32:44.3511225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3511454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3511532Z     T=1,
2025-05-07T20:32:44.3511612Z     D=5120,
2025-05-07T20:32:44.3511696Z     scale_ub=1200.0,
2025-05-07T20:32:44.3511780Z     contiguous=True,
2025-05-07T20:32:44.3511865Z     compiled=False,
2025-05-07T20:32:44.3511939Z )
2025-05-07T20:32:44.3512155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3512325Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3512330Z 
2025-05-07T20:32:44.3512407Z     @given(
2025-05-07T20:32:44.3512525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3512631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3512750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3512871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3512982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3513064Z     )
2025-05-07T20:32:44.3513310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3513405Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3513478Z         self,
2025-05-07T20:32:44.3513555Z         T: int,
2025-05-07T20:32:44.3513630Z         D: int,
2025-05-07T20:32:44.3513727Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3513819Z         contiguous: bool,
2025-05-07T20:32:44.3513905Z         compiled: bool,
2025-05-07T20:32:44.3513985Z     ) -> None:
2025-05-07T20:32:44.3514082Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3514153Z     
2025-05-07T20:32:44.3514323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3514397Z     
2025-05-07T20:32:44.3514495Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3514619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3514707Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3514790Z         x0 = x[:, :D]
2025-05-07T20:32:44.3514876Z         x1 = x[:, D:]
2025-05-07T20:32:44.3514948Z     
2025-05-07T20:32:44.3515032Z         if contiguous:
2025-05-07T20:32:44.3515126Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3515215Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3515286Z     
2025-05-07T20:32:44.3515376Z         if scale_ub is not None:
2025-05-07T20:32:44.3515481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3515619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3515692Z             )
2025-05-07T20:32:44.3515823Z         else:
2025-05-07T20:32:44.3515922Z             scale_ub_tensor = None
2025-05-07T20:32:44.3515998Z     
2025-05-07T20:32:44.3516124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3516327Z             op = silu_mul_quant
2025-05-07T20:32:44.3516414Z             if compiled:
2025-05-07T20:32:44.3516513Z                 op = torch.compile(op)
2025-05-07T20:32:44.3516702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3516775Z     
2025-05-07T20:32:44.3516864Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3516874Z 
2025-05-07T20:32:44.3516969Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3517099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3517201Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3517299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3517799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3517898Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3518260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3518481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3518819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3518918Z     kernel = self.compile(
2025-05-07T20:32:44.3519299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3519470Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3519595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3519599Z 
2025-05-07T20:32:44.3519810Z self = <triton.compiler.compiler.ASTSource object at 0x7f5bea21fe90>
2025-05-07T20:32:44.3520591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3521094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b1004a0>}
2025-05-07T20:32:44.3521840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3522034Z context = <triton._C.libtriton.ir.context object at 0x7f5b8b159670>
2025-05-07T20:32:44.3522038Z 
2025-05-07T20:32:44.3522199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3522461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3522571Z                            module_map=module_map)
2025-05-07T20:32:44.3522736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3522834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3522912Z E       ^
2025-05-07T20:32:44.3523268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3523277Z 
2025-05-07T20:32:44.3523687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3523692Z 
2025-05-07T20:32:44.3523794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3524012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3524090Z     T=2048,
2025-05-07T20:32:44.3524166Z     D=5120,
2025-05-07T20:32:44.3524246Z     scale_ub=None,
2025-05-07T20:32:44.3524332Z     contiguous=True,
2025-05-07T20:32:44.3524413Z     compiled=False,
2025-05-07T20:32:44.3524487Z )
2025-05-07T20:32:44.3524702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3524953Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3524959Z 
2025-05-07T20:32:44.3525039Z     @given(
2025-05-07T20:32:44.3525157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3525329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3525446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3525562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3525673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3525750Z     )
2025-05-07T20:32:44.3525991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3526086Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3526160Z         self,
2025-05-07T20:32:44.3526236Z         T: int,
2025-05-07T20:32:44.3526317Z         D: int,
2025-05-07T20:32:44.3526415Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3526503Z         contiguous: bool,
2025-05-07T20:32:44.3526599Z         compiled: bool,
2025-05-07T20:32:44.3526678Z     ) -> None:
2025-05-07T20:32:44.3526774Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3526850Z     
2025-05-07T20:32:44.3527019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3527092Z     
2025-05-07T20:32:44.3527187Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.3528975Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3528981Z 
2025-05-07T20:32:44.3529107Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.3529112Z 
2025-05-07T20:32:44.3529212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3529436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3529519Z     T=16384,
2025-05-07T20:32:44.3529596Z     D=5120,
2025-05-07T20:32:44.3529683Z     scale_ub=None,
2025-05-07T20:32:44.3529765Z     contiguous=True,
2025-05-07T20:32:44.3529846Z     compiled=False,
2025-05-07T20:32:44.3529921Z )
2025-05-07T20:32:44.3530136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3530306Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3530314Z 
2025-05-07T20:32:44.3530389Z     @given(
2025-05-07T20:32:44.3530506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3530607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3530723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3530837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3530955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3531032Z     )
2025-05-07T20:32:44.3531270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3531365Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3531442Z         self,
2025-05-07T20:32:44.3531518Z         T: int,
2025-05-07T20:32:44.3531597Z         D: int,
2025-05-07T20:32:44.3531692Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3531780Z         contiguous: bool,
2025-05-07T20:32:44.3531865Z         compiled: bool,
2025-05-07T20:32:44.3531942Z     ) -> None:
2025-05-07T20:32:44.3532038Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3532109Z     
2025-05-07T20:32:44.3532273Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3534142Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3534221Z 
2025-05-07T20:32:44.3534343Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3534347Z 
2025-05-07T20:32:44.3534450Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3534671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3534748Z     T=4096,
2025-05-07T20:32:44.3534826Z     D=5120,
2025-05-07T20:32:44.3534907Z     scale_ub=None,
2025-05-07T20:32:44.3534992Z     contiguous=True,
2025-05-07T20:32:44.3535080Z     compiled=False,
2025-05-07T20:32:44.3535152Z )
2025-05-07T20:32:44.3535369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3535537Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3535549Z 
2025-05-07T20:32:44.3535626Z     @given(
2025-05-07T20:32:44.3535750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3535846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3535957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3536073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3536186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3536264Z     )
2025-05-07T20:32:44.3536504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3536594Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3536675Z         self,
2025-05-07T20:32:44.3536756Z         T: int,
2025-05-07T20:32:44.3536830Z         D: int,
2025-05-07T20:32:44.3536933Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3537023Z         contiguous: bool,
2025-05-07T20:32:44.3537105Z         compiled: bool,
2025-05-07T20:32:44.3537191Z     ) -> None:
2025-05-07T20:32:44.3537284Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3537361Z     
2025-05-07T20:32:44.3537527Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3539303Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3539309Z 
2025-05-07T20:32:44.3539426Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3539430Z 
2025-05-07T20:32:44.3539529Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3539754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3539828Z     T=2048,
2025-05-07T20:32:44.3539902Z     D=5120,
2025-05-07T20:32:44.3539984Z     scale_ub=None,
2025-05-07T20:32:44.3540069Z     contiguous=False,
2025-05-07T20:32:44.3540149Z     compiled=False,
2025-05-07T20:32:44.3540224Z )
2025-05-07T20:32:44.3540436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3540604Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3540612Z 
2025-05-07T20:32:44.3540687Z     @given(
2025-05-07T20:32:44.3540802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3540985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3541101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3541215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3541402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3541476Z     )
2025-05-07T20:32:44.3541714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3541810Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3541886Z         self,
2025-05-07T20:32:44.3541960Z         T: int,
2025-05-07T20:32:44.3542037Z         D: int,
2025-05-07T20:32:44.3542133Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3542225Z         contiguous: bool,
2025-05-07T20:32:44.3542308Z         compiled: bool,
2025-05-07T20:32:44.3542384Z     ) -> None:
2025-05-07T20:32:44.3542482Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3542553Z     
2025-05-07T20:32:44.3542718Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3544496Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3544507Z 
2025-05-07T20:32:44.3544621Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3544625Z 
2025-05-07T20:32:44.3544727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3544945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3545020Z     T=4096,
2025-05-07T20:32:44.3545102Z     D=7168,
2025-05-07T20:32:44.3545185Z     scale_ub=None,
2025-05-07T20:32:44.3545273Z     contiguous=True,
2025-05-07T20:32:44.3545356Z     compiled=True,
2025-05-07T20:32:44.3545426Z )
2025-05-07T20:32:44.3545644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3545814Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.3545818Z 
2025-05-07T20:32:44.3545894Z     @given(
2025-05-07T20:32:44.3546015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3546115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3546227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3546345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3546454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3546531Z     )
2025-05-07T20:32:44.3546768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3546859Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3546943Z         self,
2025-05-07T20:32:44.3547020Z         T: int,
2025-05-07T20:32:44.3547096Z         D: int,
2025-05-07T20:32:44.3547196Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3547288Z         contiguous: bool,
2025-05-07T20:32:44.3547370Z         compiled: bool,
2025-05-07T20:32:44.3547447Z     ) -> None:
2025-05-07T20:32:44.3547547Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3547619Z     
2025-05-07T20:32:44.3547781Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3549676Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3549682Z 
2025-05-07T20:32:44.3549796Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3549872Z 
2025-05-07T20:32:44.3549981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3550199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3550276Z     T=2048,
2025-05-07T20:32:44.3550350Z     D=5120,
2025-05-07T20:32:44.3550430Z     scale_ub=1200.0,
2025-05-07T20:32:44.3550517Z     contiguous=False,
2025-05-07T20:32:44.3550599Z     compiled=False,
2025-05-07T20:32:44.3550670Z )
2025-05-07T20:32:44.3550885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3551058Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3551062Z 
2025-05-07T20:32:44.3551137Z     @given(
2025-05-07T20:32:44.3551265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3551361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3551478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3551598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3551709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3551783Z     )
2025-05-07T20:32:44.3552024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3552115Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3552196Z         self,
2025-05-07T20:32:44.3552269Z         T: int,
2025-05-07T20:32:44.3552345Z         D: int,
2025-05-07T20:32:44.3552442Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3552529Z         contiguous: bool,
2025-05-07T20:32:44.3552611Z         compiled: bool,
2025-05-07T20:32:44.3552692Z     ) -> None:
2025-05-07T20:32:44.3552785Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3552859Z     
2025-05-07T20:32:44.3553026Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3554799Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3554813Z 
2025-05-07T20:32:44.3554927Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3554931Z 
2025-05-07T20:32:44.3555030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3555253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3555329Z     T=4096,
2025-05-07T20:32:44.3555408Z     D=7168,
2025-05-07T20:32:44.3555494Z     scale_ub=1200.0,
2025-05-07T20:32:44.3555575Z     contiguous=True,
2025-05-07T20:32:44.3555657Z     compiled=False,
2025-05-07T20:32:44.3555779Z )
2025-05-07T20:32:44.3555994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3556168Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3556172Z 
2025-05-07T20:32:44.3556247Z     @given(
2025-05-07T20:32:44.3556361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3556460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3556572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3556683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3556798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3556872Z     )
2025-05-07T20:32:44.3557199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3557291Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3557367Z         self,
2025-05-07T20:32:44.3557445Z         T: int,
2025-05-07T20:32:44.3557520Z         D: int,
2025-05-07T20:32:44.3557687Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3557777Z         contiguous: bool,
2025-05-07T20:32:44.3557860Z         compiled: bool,
2025-05-07T20:32:44.3557935Z     ) -> None:
2025-05-07T20:32:44.3558032Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3558106Z     
2025-05-07T20:32:44.3558268Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3560057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3560067Z 
2025-05-07T20:32:44.3560180Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3560185Z 
2025-05-07T20:32:44.3560289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3560507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3560586Z     T=16384,
2025-05-07T20:32:44.3560660Z     D=7168,
2025-05-07T20:32:44.3560739Z     scale_ub=None,
2025-05-07T20:32:44.3560826Z     contiguous=False,
2025-05-07T20:32:44.3560908Z     compiled=True,
2025-05-07T20:32:44.3560980Z )
2025-05-07T20:32:44.3561199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3561371Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.3561375Z 
2025-05-07T20:32:44.3561458Z     @given(
2025-05-07T20:32:44.3561577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3561674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3561796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3561908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3562017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3562094Z     )
2025-05-07T20:32:44.3562334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3562424Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3562503Z         self,
2025-05-07T20:32:44.3562579Z         T: int,
2025-05-07T20:32:44.3562652Z         D: int,
2025-05-07T20:32:44.3562751Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3562837Z         contiguous: bool,
2025-05-07T20:32:44.3562921Z         compiled: bool,
2025-05-07T20:32:44.3563001Z     ) -> None:
2025-05-07T20:32:44.3563102Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3563176Z     
2025-05-07T20:32:44.3563337Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3565113Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3565126Z 
2025-05-07T20:32:44.3565239Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3565243Z 
2025-05-07T20:32:44.3565592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3565954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3566033Z     T=4096,
2025-05-07T20:32:44.3566108Z     D=7168,
2025-05-07T20:32:44.3566192Z     scale_ub=None,
2025-05-07T20:32:44.3566274Z     contiguous=True,
2025-05-07T20:32:44.3566467Z     compiled=False,
2025-05-07T20:32:44.3566543Z )
2025-05-07T20:32:44.3566757Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3566927Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3566932Z 
2025-05-07T20:32:44.3567008Z     @given(
2025-05-07T20:32:44.3567123Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3567225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3567337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3567449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3567564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3567636Z     )
2025-05-07T20:32:44.3567888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3567979Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3568053Z         self,
2025-05-07T20:32:44.3568136Z         T: int,
2025-05-07T20:32:44.3568210Z         D: int,
2025-05-07T20:32:44.3568306Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3568395Z         contiguous: bool,
2025-05-07T20:32:44.3568478Z         compiled: bool,
2025-05-07T20:32:44.3568555Z     ) -> None:
2025-05-07T20:32:44.3568651Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3568723Z     
2025-05-07T20:32:44.3568886Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3570671Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3570681Z 
2025-05-07T20:32:44.3570794Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3570801Z 
2025-05-07T20:32:44.3570900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3571119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3571200Z     T=16384,
2025-05-07T20:32:44.3571274Z     D=7168,
2025-05-07T20:32:44.3571353Z     scale_ub=None,
2025-05-07T20:32:44.3571440Z     contiguous=True,
2025-05-07T20:32:44.3571522Z     compiled=False,
2025-05-07T20:32:44.3571595Z )
2025-05-07T20:32:44.3571810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3571985Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3571990Z 
2025-05-07T20:32:44.3572069Z     @given(
2025-05-07T20:32:44.3572187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3572286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3572402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3572517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3572627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3572706Z     )
2025-05-07T20:32:44.3572945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3573037Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3573117Z         self,
2025-05-07T20:32:44.3573192Z         T: int,
2025-05-07T20:32:44.3573266Z         D: int,
2025-05-07T20:32:44.3573366Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3573452Z         contiguous: bool,
2025-05-07T20:32:44.3573540Z         compiled: bool,
2025-05-07T20:32:44.3573697Z     ) -> None:
2025-05-07T20:32:44.3573792Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3573867Z     
2025-05-07T20:32:44.3574030Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3575903Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3575912Z 
2025-05-07T20:32:44.3576026Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3576030Z 
2025-05-07T20:32:44.3576137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3576362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3576437Z     T=16384,
2025-05-07T20:32:44.3576516Z     D=7168,
2025-05-07T20:32:44.3576605Z     scale_ub=1200.0,
2025-05-07T20:32:44.3576687Z     contiguous=True,
2025-05-07T20:32:44.3576771Z     compiled=False,
2025-05-07T20:32:44.3576846Z )
2025-05-07T20:32:44.3577058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3577233Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3577238Z 
2025-05-07T20:32:44.3577314Z     @given(
2025-05-07T20:32:44.3577429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3577528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3577639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3577753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3577870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3577942Z     )
2025-05-07T20:32:44.3578184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3578280Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3578355Z         self,
2025-05-07T20:32:44.3578434Z         T: int,
2025-05-07T20:32:44.3578509Z         D: int,
2025-05-07T20:32:44.3578605Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3578696Z         contiguous: bool,
2025-05-07T20:32:44.3578780Z         compiled: bool,
2025-05-07T20:32:44.3578857Z     ) -> None:
2025-05-07T20:32:44.3578954Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3579024Z     
2025-05-07T20:32:44.3579200Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3581026Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3581036Z 
2025-05-07T20:32:44.3581150Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3581156Z 
2025-05-07T20:32:44.3581256Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3581476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3581556Z     T=128,
2025-05-07T20:32:44.3581630Z     D=5120,
2025-05-07T20:32:44.3581711Z     scale_ub=1200.0,
2025-05-07T20:32:44.3581797Z     contiguous=False,
2025-05-07T20:32:44.3581879Z     compiled=False,
2025-05-07T20:32:44.3581949Z )
2025-05-07T20:32:44.3582247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3582417Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3582422Z 
2025-05-07T20:32:44.3582496Z     @given(
2025-05-07T20:32:44.3582694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3582790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3582905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3583018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3583128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3583205Z     )
2025-05-07T20:32:44.3583441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3583532Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3583610Z         self,
2025-05-07T20:32:44.3583685Z         T: int,
2025-05-07T20:32:44.3583759Z         D: int,
2025-05-07T20:32:44.3583858Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3583949Z         contiguous: bool,
2025-05-07T20:32:44.3584035Z         compiled: bool,
2025-05-07T20:32:44.3584111Z     ) -> None:
2025-05-07T20:32:44.3584203Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3584278Z     
2025-05-07T20:32:44.3584446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3584521Z     
2025-05-07T20:32:44.3584614Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3584737Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3584823Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3584907Z         x0 = x[:, :D]
2025-05-07T20:32:44.3584985Z         x1 = x[:, D:]
2025-05-07T20:32:44.3585057Z     
2025-05-07T20:32:44.3585145Z         if contiguous:
2025-05-07T20:32:44.3585236Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3585323Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3585396Z     
2025-05-07T20:32:44.3585484Z         if scale_ub is not None:
2025-05-07T20:32:44.3585596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3585729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3585803Z             )
2025-05-07T20:32:44.3585882Z         else:
2025-05-07T20:32:44.3585981Z             scale_ub_tensor = None
2025-05-07T20:32:44.3586053Z     
2025-05-07T20:32:44.3586184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3586274Z             op = silu_mul_quant
2025-05-07T20:32:44.3586357Z             if compiled:
2025-05-07T20:32:44.3586459Z                 op = torch.compile(op)
2025-05-07T20:32:44.3586563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3586634Z     
2025-05-07T20:32:44.3586727Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3586731Z 
2025-05-07T20:32:44.3586825Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3586955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3587052Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3587153Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3587655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3587754Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3588109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3588330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3588665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3588761Z     kernel = self.compile(
2025-05-07T20:32:44.3589138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3589310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3589529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3589534Z 
2025-05-07T20:32:44.3589736Z self = <triton.compiler.compiler.ASTSource object at 0x7f5b8b0a82f0>
2025-05-07T20:32:44.3590518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3591091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8b153060>}
2025-05-07T20:32:44.3591835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3592025Z context = <triton._C.libtriton.ir.context object at 0x7f5b8af7bab0>
2025-05-07T20:32:44.3592029Z 
2025-05-07T20:32:44.3592195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3592458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3592569Z                            module_map=module_map)
2025-05-07T20:32:44.3592730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3592829Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3592905Z E       ^
2025-05-07T20:32:44.3593264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3593269Z 
2025-05-07T20:32:44.3593676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3593680Z 
2025-05-07T20:32:44.3593780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3594007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3594087Z     T=2048,
2025-05-07T20:32:44.3594164Z     D=7168,
2025-05-07T20:32:44.3594246Z     scale_ub=None,
2025-05-07T20:32:44.3594330Z     contiguous=False,
2025-05-07T20:32:44.3594422Z     compiled=False,
2025-05-07T20:32:44.3594493Z )
2025-05-07T20:32:44.3594708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3594884Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3594889Z 
2025-05-07T20:32:44.3594965Z     @given(
2025-05-07T20:32:44.3595081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3595182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3595296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3595414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3595526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3595599Z     )
2025-05-07T20:32:44.3595900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3595992Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3596066Z         self,
2025-05-07T20:32:44.3596145Z         T: int,
2025-05-07T20:32:44.3596223Z         D: int,
2025-05-07T20:32:44.3596320Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3596410Z         contiguous: bool,
2025-05-07T20:32:44.3596494Z         compiled: bool,
2025-05-07T20:32:44.3596571Z     ) -> None:
2025-05-07T20:32:44.3596666Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3596740Z     
2025-05-07T20:32:44.3596903Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3598775Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3598852Z 
2025-05-07T20:32:44.3598970Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3598979Z 
2025-05-07T20:32:44.3599078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3599298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3599377Z     T=128,
2025-05-07T20:32:44.3599453Z     D=7168,
2025-05-07T20:32:44.3599534Z     scale_ub=1200.0,
2025-05-07T20:32:44.3599622Z     contiguous=True,
2025-05-07T20:32:44.3599702Z     compiled=True,
2025-05-07T20:32:44.3599773Z )
2025-05-07T20:32:44.3599988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3600152Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3600162Z 
2025-05-07T20:32:44.3600240Z     @given(
2025-05-07T20:32:44.3600358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3600454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3600577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3600690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3600799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3600876Z     )
2025-05-07T20:32:44.3601114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3601206Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3601287Z         self,
2025-05-07T20:32:44.3601361Z         T: int,
2025-05-07T20:32:44.3601435Z         D: int,
2025-05-07T20:32:44.3601533Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3601619Z         contiguous: bool,
2025-05-07T20:32:44.3601704Z         compiled: bool,
2025-05-07T20:32:44.3601782Z     ) -> None:
2025-05-07T20:32:44.3601877Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3601954Z     
2025-05-07T20:32:44.3602118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3602197Z     
2025-05-07T20:32:44.3602289Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3602411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3602497Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3602580Z         x0 = x[:, :D]
2025-05-07T20:32:44.3602658Z         x1 = x[:, D:]
2025-05-07T20:32:44.3602729Z     
2025-05-07T20:32:44.3602812Z         if contiguous:
2025-05-07T20:32:44.3602902Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3602991Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3603065Z     
2025-05-07T20:32:44.3603155Z         if scale_ub is not None:
2025-05-07T20:32:44.3603258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3603392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3603470Z             )
2025-05-07T20:32:44.3603548Z         else:
2025-05-07T20:32:44.3603641Z             scale_ub_tensor = None
2025-05-07T20:32:44.3603712Z     
2025-05-07T20:32:44.3603843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3603937Z             op = silu_mul_quant
2025-05-07T20:32:44.3604020Z             if compiled:
2025-05-07T20:32:44.3604120Z                 op = torch.compile(op)
2025-05-07T20:32:44.3604223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3604294Z     
2025-05-07T20:32:44.3604388Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3604392Z 
2025-05-07T20:32:44.3604486Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3604617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3604714Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3604812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3605344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.3605437Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.3605926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3606125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3606479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3606703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3607036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3607127Z     kernel = self.compile(
2025-05-07T20:32:44.3607506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3607675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3607805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3607813Z 
2025-05-07T20:32:44.3608016Z self = <triton.compiler.compiler.ASTSource object at 0x7f5b8af06ed0>
2025-05-07T20:32:44.3608795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3609301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f5c44b29300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f5b8afe0900>}
2025-05-07T20:32:44.3610041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3610238Z context = <triton._C.libtriton.ir.context object at 0x7f5b8afb4b70>
2025-05-07T20:32:44.3610242Z 
2025-05-07T20:32:44.3610403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3610664Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3610776Z                            module_map=module_map)
2025-05-07T20:32:44.3610935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3611032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3611111Z E       ^
2025-05-07T20:32:44.3611462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3611466Z 
2025-05-07T20:32:44.3611874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3611879Z 
2025-05-07T20:32:44.3611979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3612204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3612283Z     T=128,
2025-05-07T20:32:44.3612357Z     D=7168,
2025-05-07T20:32:44.3612441Z     scale_ub=1200.0,
2025-05-07T20:32:44.3612527Z     contiguous=True,
2025-05-07T20:32:44.3612608Z     compiled=False,
2025-05-07T20:32:44.3612682Z )
2025-05-07T20:32:44.3612898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3613064Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3613068Z 
2025-05-07T20:32:44.3613148Z     @given(
2025-05-07T20:32:44.3613264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3613362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3613481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3613594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3613708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3613868Z     )
2025-05-07T20:32:44.3614110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3614203Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3614350Z         self,
2025-05-07T20:32:44.3614425Z         T: int,
2025-05-07T20:32:44.3614503Z         D: int,
2025-05-07T20:32:44.3614600Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3614687Z         contiguous: bool,
2025-05-07T20:32:44.3614773Z         compiled: bool,
2025-05-07T20:32:44.3614850Z     ) -> None:
2025-05-07T20:32:44.3614944Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3615021Z     
2025-05-07T20:32:44.3615185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3615259Z     
2025-05-07T20:32:44.3615350Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3615471Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3617266Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3617277Z 
2025-05-07T20:32:44.3617394Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.3617398Z 
2025-05-07T20:32:44.3617501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3617720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3617794Z     T=128,
2025-05-07T20:32:44.3617871Z     D=5120,
2025-05-07T20:32:44.3617950Z     scale_ub=1200.0,
2025-05-07T20:32:44.3618030Z     contiguous=True,
2025-05-07T20:32:44.3618114Z     compiled=True,
2025-05-07T20:32:44.3618190Z )
2025-05-07T20:32:44.3618404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3618572Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.3618582Z 
2025-05-07T20:32:44.3618657Z     @given(
2025-05-07T20:32:44.3618776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3618872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3618984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3619104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3619236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3619318Z     )
2025-05-07T20:32:44.3619576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3619667Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3619744Z         self,
2025-05-07T20:32:44.3619819Z         T: int,
2025-05-07T20:32:44.3619897Z         D: int,
2025-05-07T20:32:44.3619996Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3620082Z         contiguous: bool,
2025-05-07T20:32:44.3620165Z         compiled: bool,
2025-05-07T20:32:44.3620248Z     ) -> None:
2025-05-07T20:32:44.3620339Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3620410Z     
2025-05-07T20:32:44.3620574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3620650Z     
2025-05-07T20:32:44.3620740Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.3622602Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3622609Z 
2025-05-07T20:32:44.3622726Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.3622805Z 
2025-05-07T20:32:44.3622909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3623128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3623206Z     T=128,
2025-05-07T20:32:44.3623279Z     D=7168,
2025-05-07T20:32:44.3623359Z     scale_ub=None,
2025-05-07T20:32:44.3623443Z     contiguous=True,
2025-05-07T20:32:44.3623523Z     compiled=True,
2025-05-07T20:32:44.3623594Z )
2025-05-07T20:32:44.3623810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3623970Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.3623975Z 
2025-05-07T20:32:44.3624050Z     @given(
2025-05-07T20:32:44.3624174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3624271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3624388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3624501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3624616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3624692Z     )
2025-05-07T20:32:44.3624932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3625021Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3625100Z         self,
2025-05-07T20:32:44.3625176Z         T: int,
2025-05-07T20:32:44.3625251Z         D: int,
2025-05-07T20:32:44.3625349Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3625437Z         contiguous: bool,
2025-05-07T20:32:44.3625519Z         compiled: bool,
2025-05-07T20:32:44.3625599Z     ) -> None:
2025-05-07T20:32:44.3625690Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3625765Z     
2025-05-07T20:32:44.3625938Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3627711Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3627722Z 
2025-05-07T20:32:44.3627845Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3627979Z =============================== warnings summary ===============================
2025-05-07T20:32:44.3628287Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.3628589Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.3632990Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.3633898Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:44.3634125Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:44.3634133Z 
2025-05-07T20:32:44.3634310Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:44.3635688Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:44.3635992Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:44.3636078Z 
2025-05-07T20:32:44.3636289Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:44.3636458Z ================== 1 failed, 1 passed, 13 warnings in 20.18s ===================
2025-05-07T20:32:46.0908165Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:46.1522566Z 
2025-05-07T20:32:46.1523144Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:46.1523624Z 
2025-05-07T20:32:46.1523630Z 
2025-05-07T20:32:46.1543643Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:48.3093925Z ============================= test session starts ==============================
2025-05-07T20:32:48.3095625Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:48.3096722Z cachedir: .pytest_cache
2025-05-07T20:32:48.3097845Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:48.3099270Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:48.3100078Z plugins: hypothesis-6.131.14
2025-05-07T20:32:49.9301084Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:50.0388516Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:50.0389559Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:50.0390108Z 
2025-05-07T20:32:52.1320858Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:52.1321981Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:52.1323353Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:52.1324830Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:52.1325833Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1327136Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:52.1328527Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1329520Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1330752Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:52.1332469Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1333692Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1334983Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:52.1336235Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:52.1337467Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:52.1338683Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:52.1339517Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1340551Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:52.1341574Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:52.1342380Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:52.1343593Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:52.1344891Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:52.1346017Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:52.1347063Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:52.1348247Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:52.1349607Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:52.1350679Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1351603Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1352354Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:52.1353371Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1482905Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:52.1483984Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:52.1485477Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:52.1486905Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:52.1487892Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1489357Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:52.1490748Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1491734Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1492974Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:52.1494358Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1495417Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1496711Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:52.1498111Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:52.1499411Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:52.1500625Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:52.1501444Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.1502476Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:52.1503498Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:52.1504299Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:52.1505616Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:52.1506893Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:52.1508087Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:52.1509134Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:52.1510314Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:52.1511678Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:52.1512735Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1513657Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1514400Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:52.1515422Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.5693984Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5694643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5695071Z     T=1,
2025-05-07T20:32:52.5695269Z     D=5120,
2025-05-07T20:32:52.5695462Z     scale_ub=None,
2025-05-07T20:32:52.5695687Z     contiguous=True,
2025-05-07T20:32:52.5695916Z     compiled=True,
2025-05-07T20:32:52.5696137Z )
2025-05-07T20:32:52.5696466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.5696960Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.5697221Z 
2025-05-07T20:32:52.5697303Z     @given(
2025-05-07T20:32:52.5697539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.5697869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.5698177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.5698519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.5698856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.5699153Z     )
2025-05-07T20:32:52.5699509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.5699972Z     def test_silu_mul_quant(
2025-05-07T20:32:52.5700228Z         self,
2025-05-07T20:32:52.5700432Z         T: int,
2025-05-07T20:32:52.5700642Z         D: int,
2025-05-07T20:32:52.5700874Z         scale_ub: Optional[float],
2025-05-07T20:32:52.5701151Z         contiguous: bool,
2025-05-07T20:32:52.5701402Z         compiled: bool,
2025-05-07T20:32:52.5701646Z     ) -> None:
2025-05-07T20:32:52.5701861Z         torch.manual_seed(2025)
2025-05-07T20:32:52.5702113Z     
2025-05-07T20:32:52.5702398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.5702742Z     
2025-05-07T20:32:52.5702946Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.5703250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.5703562Z         x = x_sign * x_clamp
2025-05-07T20:32:52.5703817Z         x0 = x[:, :D]
2025-05-07T20:32:52.5704041Z         x1 = x[:, D:]
2025-05-07T20:32:52.5704248Z     
2025-05-07T20:32:52.5704708Z         if contiguous:
2025-05-07T20:32:52.5704954Z             x0 = x0.contiguous()
2025-05-07T20:32:52.5705229Z             x1 = x1.contiguous()
2025-05-07T20:32:52.5705472Z     
2025-05-07T20:32:52.5705821Z         if scale_ub is not None:
2025-05-07T20:32:52.5706104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.5706441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.5706758Z             )
2025-05-07T20:32:52.5706956Z         else:
2025-05-07T20:32:52.5707168Z             scale_ub_tensor = None
2025-05-07T20:32:52.5707426Z     
2025-05-07T20:32:52.5707663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.5707977Z             op = silu_mul_quant
2025-05-07T20:32:52.5708238Z             if compiled:
2025-05-07T20:32:52.5708492Z                 op = torch.compile(op)
2025-05-07T20:32:52.5708787Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.5709067Z     
2025-05-07T20:32:52.5709275Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.5709561Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.5709867Z     
2025-05-07T20:32:52.5710116Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.5710464Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.5710760Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.5711080Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.5711448Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.5711760Z     
2025-05-07T20:32:52.5711973Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.5712170Z 
2025-05-07T20:32:52.5712282Z moe/activation_test.py:126: 
2025-05-07T20:32:52.5712582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5712928Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.5713262Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.5714063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.5714815Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.5715373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.5716135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.5716829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.5717552Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.5718290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.5718935Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.5719539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.5720061Z     fn()
2025-05-07T20:32:52.5720575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.5721164Z     self.fn.run(
2025-05-07T20:32:52.5721628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.5722163Z     kernel = self.compile(
2025-05-07T20:32:52.5722704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.5723352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5723787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.5724047Z 
2025-05-07T20:32:52.5724254Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91b764e0>
2025-05-07T20:32:52.5725432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.5726908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b91f50c20>}
2025-05-07T20:32:52.5728251Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.5729279Z context = <triton._C.libtriton.ir.context object at 0x7f8b923659f0>
2025-05-07T20:32:52.5729568Z 
2025-05-07T20:32:52.5729742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.5730275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.5730744Z                            module_map=module_map)
2025-05-07T20:32:52.5731113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.5731485Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.5731750Z E       ^
2025-05-07T20:32:52.5732218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.5732672Z 
2025-05-07T20:32:52.5733094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.5733608Z 
2025-05-07T20:32:52.5733742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.5734177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.5734597Z     T=2048,
2025-05-07T20:32:52.5734790Z     D=5120,
2025-05-07T20:32:52.5734982Z     scale_ub=1200.0,
2025-05-07T20:32:52.5735209Z     contiguous=True,
2025-05-07T20:32:52.5735440Z     compiled=False,
2025-05-07T20:32:52.5735645Z )
2025-05-07T20:32:53.0200064Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.0201175Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:53.0202523Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.0204006Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.0205008Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.0206306Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.0207692Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0208679Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.0210241Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.0211622Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0212832Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.0214116Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.0215364Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:53.0216592Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.0217805Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:53.0218635Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.0219662Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:53.0220678Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:53.0221474Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:53.0222682Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.0224012Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.0225134Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:53.0226171Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:53.0227351Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.0228698Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.0229759Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0230671Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0231411Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:53.0232422Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1116737Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.1118869Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:53.1121871Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.1124245Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.1125221Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.1126537Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.1127932Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1128927Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.1130163Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.1131545Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1132621Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.1133940Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.1135216Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:53.1136440Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.1137647Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:53.1138479Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.1139514Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:53.1140534Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:53.1141328Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:53.1142631Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.1143918Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.1145115Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:53.1146165Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:53.1147337Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.1148705Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.1149770Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1150693Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1151431Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:53.1152447Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5796537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5797109Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.5797496Z 
2025-05-07T20:32:53.5797593Z     @given(
2025-05-07T20:32:53.5797827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5798154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5798478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5798812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5799150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5799445Z     )
2025-05-07T20:32:53.5799794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5800241Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5800491Z         self,
2025-05-07T20:32:53.5800691Z         T: int,
2025-05-07T20:32:53.5800895Z         D: int,
2025-05-07T20:32:53.5801123Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5801399Z         contiguous: bool,
2025-05-07T20:32:53.5801665Z         compiled: bool,
2025-05-07T20:32:53.5801897Z     ) -> None:
2025-05-07T20:32:53.5802121Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5802374Z     
2025-05-07T20:32:53.5802665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5803019Z     
2025-05-07T20:32:53.5803230Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5803529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5803846Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5804095Z         x0 = x[:, :D]
2025-05-07T20:32:53.5804327Z         x1 = x[:, D:]
2025-05-07T20:32:53.5804533Z     
2025-05-07T20:32:53.5804770Z         if contiguous:
2025-05-07T20:32:53.5805004Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5805279Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5805532Z     
2025-05-07T20:32:53.5805723Z         if scale_ub is not None:
2025-05-07T20:32:53.5806007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5806350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5806895Z             )
2025-05-07T20:32:53.5807098Z         else:
2025-05-07T20:32:53.5807313Z             scale_ub_tensor = None
2025-05-07T20:32:53.5807568Z     
2025-05-07T20:32:53.5807813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5808280Z             op = silu_mul_quant
2025-05-07T20:32:53.5808536Z             if compiled:
2025-05-07T20:32:53.5808786Z                 op = torch.compile(op)
2025-05-07T20:32:53.5809088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5809376Z     
2025-05-07T20:32:53.5809571Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5809741Z 
2025-05-07T20:32:53.5809844Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5810148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5810484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5810772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5811475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5812178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5812718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5813412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5814084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5814615Z     kernel = self.compile(
2025-05-07T20:32:53.5815163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5815821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5816229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5816461Z 
2025-05-07T20:32:53.5816676Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b920eb1a0>
2025-05-07T20:32:53.5817775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5819178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b91e10180>}
2025-05-07T20:32:53.5820529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5821567Z context = <triton._C.libtriton.ir.context object at 0x7f8b90606af0>
2025-05-07T20:32:53.5821856Z 
2025-05-07T20:32:53.5822028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5822572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5823049Z                            module_map=module_map)
2025-05-07T20:32:53.5823413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5823779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5824051Z E       ^
2025-05-07T20:32:53.5824518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5824971Z 
2025-05-07T20:32:53.5825384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5825904Z 
2025-05-07T20:32:53.5826010Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5826427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5826841Z     T=2048,
2025-05-07T20:32:53.5827030Z     D=5120,
2025-05-07T20:32:53.5827228Z     scale_ub=1200.0,
2025-05-07T20:32:53.5827589Z     contiguous=True,
2025-05-07T20:32:53.5827815Z     compiled=True,
2025-05-07T20:32:53.5828027Z )
2025-05-07T20:32:53.5828353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5828928Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.5829207Z 
2025-05-07T20:32:53.5829287Z     @given(
2025-05-07T20:32:53.5829522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5829835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5830151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5830500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5830827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5831125Z     )
2025-05-07T20:32:53.5840356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5840850Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5841107Z         self,
2025-05-07T20:32:53.5841317Z         T: int,
2025-05-07T20:32:53.5841529Z         D: int,
2025-05-07T20:32:53.5841754Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5842048Z         contiguous: bool,
2025-05-07T20:32:53.5842301Z         compiled: bool,
2025-05-07T20:32:53.5842528Z     ) -> None:
2025-05-07T20:32:53.5842760Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5843022Z     
2025-05-07T20:32:53.5843301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5843658Z     
2025-05-07T20:32:53.5843870Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5844196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5844537Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5844793Z         x0 = x[:, :D]
2025-05-07T20:32:53.5845019Z         x1 = x[:, D:]
2025-05-07T20:32:53.5845229Z     
2025-05-07T20:32:53.5845426Z         if contiguous:
2025-05-07T20:32:53.5845670Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5845942Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5846195Z     
2025-05-07T20:32:53.5846405Z         if scale_ub is not None:
2025-05-07T20:32:53.5846686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5847041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5847367Z             )
2025-05-07T20:32:53.5847574Z         else:
2025-05-07T20:32:53.5847791Z             scale_ub_tensor = None
2025-05-07T20:32:53.5848067Z     
2025-05-07T20:32:53.5848315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5848634Z             op = silu_mul_quant
2025-05-07T20:32:53.5848900Z             if compiled:
2025-05-07T20:32:53.5849165Z                 op = torch.compile(op)
2025-05-07T20:32:53.5849470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5849765Z     
2025-05-07T20:32:53.5850042Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.5850428Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.5850751Z     
2025-05-07T20:32:53.5851003Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5851344Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.5851655Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.5851986Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.5852370Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.5852693Z     
2025-05-07T20:32:53.5852924Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.5853127Z 
2025-05-07T20:32:53.5853242Z moe/activation_test.py:126: 
2025-05-07T20:32:53.5853544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5853894Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.5854236Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.5855160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.5855934Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.5856489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5857263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5857954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.5858687Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.5859429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.5860079Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.5860683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.5861393Z     fn()
2025-05-07T20:32:53.5862023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.5862620Z     self.fn.run(
2025-05-07T20:32:53.5863104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5863656Z     kernel = self.compile(
2025-05-07T20:32:53.5864211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5864874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5865289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5865873Z 
2025-05-07T20:32:53.5866094Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91ab5b80>
2025-05-07T20:32:53.5867192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5868571Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b906de840>}
2025-05-07T20:32:53.5869930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5870968Z context = <triton._C.libtriton.ir.context object at 0x7f8b903e2d30>
2025-05-07T20:32:53.5871260Z 
2025-05-07T20:32:53.5871440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5871970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5872453Z                            module_map=module_map)
2025-05-07T20:32:53.5872834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5873203Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.5873471Z E       ^
2025-05-07T20:32:53.5873952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5874412Z 
2025-05-07T20:32:53.5874841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5875354Z 
2025-05-07T20:32:53.5875469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5875947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5876366Z     T=16384,
2025-05-07T20:32:53.5876574Z     D=7168,
2025-05-07T20:32:53.5876771Z     scale_ub=1200.0,
2025-05-07T20:32:53.5877013Z     contiguous=False,
2025-05-07T20:32:53.5877256Z     compiled=False,
2025-05-07T20:32:53.5877467Z )
2025-05-07T20:32:53.8345963Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.8348380Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:53.8351067Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.8353930Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.8355133Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8356545Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.8357945Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8358938Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8360176Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.8361559Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8362636Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8363929Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.8365188Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:53.8366768Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.8367989Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:53.8368833Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8369986Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:53.8371017Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:53.8371817Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:53.8373199Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.8374553Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.8375801Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:53.8376857Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:53.8378042Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.8379419Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.8380633Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8381571Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8382324Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:53.8383349Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8971630Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.8972853Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:53.8974221Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.8975659Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.8976641Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8977961Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.8979355Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8980357Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8981591Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.8983308Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8984409Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8985901Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.8987169Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:53.8988389Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.8989609Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:53.8990448Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.8991484Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:53.8992501Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:53.8993307Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:53.8994526Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.8995907Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.8997030Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:53.8998077Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:53.8999261Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.9000629Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.9001705Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9002626Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9003362Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:53.9004395Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4036282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4037185Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.4037662Z 
2025-05-07T20:32:54.4037792Z     @given(
2025-05-07T20:32:54.4038543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4039056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4039533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4040314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4040819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4041270Z     )
2025-05-07T20:32:54.4041840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4042561Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4042933Z         self,
2025-05-07T20:32:54.4043243Z         T: int,
2025-05-07T20:32:54.4043553Z         D: int,
2025-05-07T20:32:54.4043891Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4044335Z         contiguous: bool,
2025-05-07T20:32:54.4044738Z         compiled: bool,
2025-05-07T20:32:54.4045111Z     ) -> None:
2025-05-07T20:32:54.4045460Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4045848Z     
2025-05-07T20:32:54.4046290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4046856Z     
2025-05-07T20:32:54.4047174Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4047656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4048152Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4048548Z         x0 = x[:, :D]
2025-05-07T20:32:54.4048898Z         x1 = x[:, D:]
2025-05-07T20:32:54.4049221Z     
2025-05-07T20:32:54.4049519Z         if contiguous:
2025-05-07T20:32:54.4049888Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4050296Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4050686Z     
2025-05-07T20:32:54.4050992Z         if scale_ub is not None:
2025-05-07T20:32:54.4051419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4051952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4052446Z             )
2025-05-07T20:32:54.4052742Z         else:
2025-05-07T20:32:54.4053084Z             scale_ub_tensor = None
2025-05-07T20:32:54.4053481Z     
2025-05-07T20:32:54.4053839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4054351Z             op = silu_mul_quant
2025-05-07T20:32:54.4054770Z             if compiled:
2025-05-07T20:32:54.4055173Z                 op = torch.compile(op)
2025-05-07T20:32:54.4055593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4055986Z     
2025-05-07T20:32:54.4056270Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4056516Z 
2025-05-07T20:32:54.4056666Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4057111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4057636Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4058060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4059191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4060387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4061319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4062407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4063512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4064379Z     kernel = self.compile(
2025-05-07T20:32:54.4065325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4066885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4067573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4067960Z 
2025-05-07T20:32:54.4068308Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91a7f9b0>
2025-05-07T20:32:54.4070422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4072802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cd260>}
2025-05-07T20:32:54.4075044Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4076755Z context = <triton._C.libtriton.ir.context object at 0x7f8b90402230>
2025-05-07T20:32:54.4077217Z 
2025-05-07T20:32:54.4077495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4078351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4079128Z                            module_map=module_map)
2025-05-07T20:32:54.4079731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4080299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4080740Z E       ^
2025-05-07T20:32:54.4081496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4082237Z 
2025-05-07T20:32:54.4082972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4083793Z 
2025-05-07T20:32:54.4083951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4084574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4085300Z     T=1,
2025-05-07T20:32:54.4085601Z     D=7168,
2025-05-07T20:32:54.4085921Z     scale_ub=None,
2025-05-07T20:32:54.4086287Z     contiguous=True,
2025-05-07T20:32:54.4086655Z     compiled=True,
2025-05-07T20:32:54.4086998Z )
2025-05-07T20:32:54.4087536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4088370Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.4088826Z 
2025-05-07T20:32:54.4088955Z     @given(
2025-05-07T20:32:54.4089342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4089871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4090384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4090949Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4091509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4091989Z     )
2025-05-07T20:32:54.4092589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4093355Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4093759Z         self,
2025-05-07T20:32:54.4094080Z         T: int,
2025-05-07T20:32:54.4094418Z         D: int,
2025-05-07T20:32:54.4094784Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4095217Z         contiguous: bool,
2025-05-07T20:32:54.4095594Z         compiled: bool,
2025-05-07T20:32:54.4095972Z     ) -> None:
2025-05-07T20:32:54.4096308Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4096703Z     
2025-05-07T20:32:54.4097137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4097683Z     
2025-05-07T20:32:54.4098004Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4098497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4099014Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4099420Z         x0 = x[:, :D]
2025-05-07T20:32:54.4099779Z         x1 = x[:, D:]
2025-05-07T20:32:54.4100118Z     
2025-05-07T20:32:54.4100425Z         if contiguous:
2025-05-07T20:32:54.4100813Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4101246Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4101640Z     
2025-05-07T20:32:54.4102104Z         if scale_ub is not None:
2025-05-07T20:32:54.4102582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4103135Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4103735Z             )
2025-05-07T20:32:54.4104063Z         else:
2025-05-07T20:32:54.4104408Z             scale_ub_tensor = None
2025-05-07T20:32:54.4104834Z     
2025-05-07T20:32:54.4105220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4105749Z             op = silu_mul_quant
2025-05-07T20:32:54.4106168Z             if compiled:
2025-05-07T20:32:54.4106582Z                 op = torch.compile(op)
2025-05-07T20:32:54.4107081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4107556Z     
2025-05-07T20:32:54.4107876Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.4108345Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.4108848Z     
2025-05-07T20:32:54.4109250Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4109825Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.4110320Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.4110853Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.4111471Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4111998Z     
2025-05-07T20:32:54.4112332Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.4112668Z 
2025-05-07T20:32:54.4112841Z moe/activation_test.py:126: 
2025-05-07T20:32:54.4113340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4113918Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.4114478Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.4115901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.4116922Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.4117664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4118603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4119534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.4120558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.4121618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.4122563Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.4123411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.4124158Z     fn()
2025-05-07T20:32:54.4124959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.4125788Z     self.fn.run(
2025-05-07T20:32:54.4126474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4127278Z     kernel = self.compile(
2025-05-07T20:32:54.4128111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4129141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4129794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4130164Z 
2025-05-07T20:32:54.4130496Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90a79460>
2025-05-07T20:32:54.4132449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4134850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67ecaac0>}
2025-05-07T20:32:54.4137194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4138826Z context = <triton._C.libtriton.ir.context object at 0x7f8b67ff01b0>
2025-05-07T20:32:54.4139222Z 
2025-05-07T20:32:54.4139495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4140266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4140963Z                            module_map=module_map)
2025-05-07T20:32:54.4141524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4142074Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.4142489Z E       ^
2025-05-07T20:32:54.4143212Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4143924Z 
2025-05-07T20:32:54.4144576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4145382Z 
2025-05-07T20:32:54.4145543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4146181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4146810Z     T=4096,
2025-05-07T20:32:54.4147100Z     D=5120,
2025-05-07T20:32:54.4147388Z     scale_ub=None,
2025-05-07T20:32:54.4147726Z     contiguous=False,
2025-05-07T20:32:54.4148071Z     compiled=False,
2025-05-07T20:32:54.4148383Z )
2025-05-07T20:32:54.8690051Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:54.8691922Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:54.8694218Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:54.8696590Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:54.8698302Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8700511Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:54.8702839Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8704528Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8706660Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.8709462Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8711315Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8713749Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:54.8716065Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:54.8718167Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:54.8720283Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:54.8721612Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8723343Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:54.8725097Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:54.8726445Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:54.8728420Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:54.8730515Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:54.8732292Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:54.8734004Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:54.8735932Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:54.8738207Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:54.8740028Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8741530Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8742741Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:54.8744510Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.0955356Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:55.0957693Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:55.0959957Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:55.0972323Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:55.0974014Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.0976305Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:55.0978743Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.0980468Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.0982473Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.0984763Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.0986631Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.0988753Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:55.0990773Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:55.0992724Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:55.0994617Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:55.0995946Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.0997416Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:55.0998956Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:55.1000234Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:55.1002175Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:55.1004465Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:55.1006453Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:55.1008392Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:55.1010328Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:55.1012694Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:55.1014463Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1015999Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1017283Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:55.1018986Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6993913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6994520Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.6994963Z 
2025-05-07T20:32:55.6995057Z     @given(
2025-05-07T20:32:55.6995535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6996316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6996951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6997639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6998293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6998897Z     )
2025-05-07T20:32:55.6999607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.7000509Z     def test_silu_mul_quant(
2025-05-07T20:32:55.7000995Z         self,
2025-05-07T20:32:55.7001397Z         T: int,
2025-05-07T20:32:55.7001803Z         D: int,
2025-05-07T20:32:55.7002236Z         scale_ub: Optional[float],
2025-05-07T20:32:55.7002781Z         contiguous: bool,
2025-05-07T20:32:55.7003263Z         compiled: bool,
2025-05-07T20:32:55.7003713Z     ) -> None:
2025-05-07T20:32:55.7004152Z         torch.manual_seed(2025)
2025-05-07T20:32:55.7004644Z     
2025-05-07T20:32:55.7005144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.7005509Z     
2025-05-07T20:32:55.7005712Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.7006004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.7006330Z         x = x_sign * x_clamp
2025-05-07T20:32:55.7006591Z         x0 = x[:, :D]
2025-05-07T20:32:55.7006812Z         x1 = x[:, D:]
2025-05-07T20:32:55.7007026Z     
2025-05-07T20:32:55.7007223Z         if contiguous:
2025-05-07T20:32:55.7007456Z             x0 = x0.contiguous()
2025-05-07T20:32:55.7007728Z             x1 = x1.contiguous()
2025-05-07T20:32:55.7007979Z     
2025-05-07T20:32:55.7008177Z         if scale_ub is not None:
2025-05-07T20:32:55.7008463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.7008807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.7009129Z             )
2025-05-07T20:32:55.7009325Z         else:
2025-05-07T20:32:55.7009548Z             scale_ub_tensor = None
2025-05-07T20:32:55.7009811Z     
2025-05-07T20:32:55.7010391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7010724Z             op = silu_mul_quant
2025-05-07T20:32:55.7010985Z             if compiled:
2025-05-07T20:32:55.7011237Z                 op = torch.compile(op)
2025-05-07T20:32:55.7011728Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7012018Z     
2025-05-07T20:32:55.7012215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.7012391Z 
2025-05-07T20:32:55.7012496Z moe/activation_test.py:117: 
2025-05-07T20:32:55.7012801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7013146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.7013430Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7014132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.7014830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.7015372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.7016069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.7016740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.7017288Z     kernel = self.compile(
2025-05-07T20:32:55.7017832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.7018496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7018907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7019137Z 
2025-05-07T20:32:55.7019344Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90873920>
2025-05-07T20:32:55.7020437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.7022042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cdee0>}
2025-05-07T20:32:55.7023400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.7024435Z context = <triton._C.libtriton.ir.context object at 0x7f8b9004e9f0>
2025-05-07T20:32:55.7024729Z 
2025-05-07T20:32:55.7024897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.7025430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7025910Z                            module_map=module_map)
2025-05-07T20:32:55.7026285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7026642Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.7026914Z E       ^
2025-05-07T20:32:55.7027389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7027847Z 
2025-05-07T20:32:55.7028263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.7028781Z 
2025-05-07T20:32:55.7028888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.7029311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.7029729Z     T=4096,
2025-05-07T20:32:55.7029924Z     D=7168,
2025-05-07T20:32:55.7030128Z     scale_ub=None,
2025-05-07T20:32:55.7030355Z     contiguous=False,
2025-05-07T20:32:55.7030583Z     compiled=False,
2025-05-07T20:32:55.7030803Z )
2025-05-07T20:32:55.7031233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.7031736Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.7032024Z 
2025-05-07T20:32:55.7032106Z     @given(
2025-05-07T20:32:55.7032425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.7032748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.7033057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.7033398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.7033734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.7034022Z     )
2025-05-07T20:32:55.7034381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.7034998Z     def test_silu_mul_quant(
2025-05-07T20:32:55.7035252Z         self,
2025-05-07T20:32:55.7035456Z         T: int,
2025-05-07T20:32:55.7035665Z         D: int,
2025-05-07T20:32:55.7035948Z         scale_ub: Optional[float],
2025-05-07T20:32:55.7036235Z         contiguous: bool,
2025-05-07T20:32:55.7036486Z         compiled: bool,
2025-05-07T20:32:55.7036714Z     ) -> None:
2025-05-07T20:32:55.7036940Z         torch.manual_seed(2025)
2025-05-07T20:32:55.7037197Z     
2025-05-07T20:32:55.7037470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.7037823Z     
2025-05-07T20:32:55.7038028Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.7038330Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.7038647Z         x = x_sign * x_clamp
2025-05-07T20:32:55.7038895Z         x0 = x[:, :D]
2025-05-07T20:32:55.7039112Z         x1 = x[:, D:]
2025-05-07T20:32:55.7039327Z     
2025-05-07T20:32:55.7039522Z         if contiguous:
2025-05-07T20:32:55.7039755Z             x0 = x0.contiguous()
2025-05-07T20:32:55.7040021Z             x1 = x1.contiguous()
2025-05-07T20:32:55.7040268Z     
2025-05-07T20:32:55.7040463Z         if scale_ub is not None:
2025-05-07T20:32:55.7040749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.7041087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.7041399Z             )
2025-05-07T20:32:55.7041595Z         else:
2025-05-07T20:32:55.7041819Z             scale_ub_tensor = None
2025-05-07T20:32:55.7042076Z     
2025-05-07T20:32:55.7042308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7042630Z             op = silu_mul_quant
2025-05-07T20:32:55.7042887Z             if compiled:
2025-05-07T20:32:55.7043133Z                 op = torch.compile(op)
2025-05-07T20:32:55.7043434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7043712Z     
2025-05-07T20:32:55.7043910Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.7044081Z 
2025-05-07T20:32:55.7044185Z moe/activation_test.py:117: 
2025-05-07T20:32:55.7044488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7044821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.7045116Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7045806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.7046500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.7047045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.7047731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.7048398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.7048929Z     kernel = self.compile(
2025-05-07T20:32:55.7049471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.7050134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7050646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7050881Z 
2025-05-07T20:32:55.7051089Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90668620>
2025-05-07T20:32:55.7052177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.7053637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cd940>}
2025-05-07T20:32:55.7054991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.7056027Z context = <triton._C.libtriton.ir.context object at 0x7f8b900a35b0>
2025-05-07T20:32:55.7056320Z 
2025-05-07T20:32:55.7056500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.7057035Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7057531Z                            module_map=module_map)
2025-05-07T20:32:55.7057896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7058261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.7058535Z E       ^
2025-05-07T20:32:55.7059009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7059461Z 
2025-05-07T20:32:55.7059877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.7060395Z 
2025-05-07T20:32:55.7060503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.7060929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.7061345Z     T=128,
2025-05-07T20:32:55.7061543Z     D=7168,
2025-05-07T20:32:55.7061755Z     scale_ub=None,
2025-05-07T20:32:55.7061990Z     contiguous=False,
2025-05-07T20:32:55.7062231Z     compiled=True,
2025-05-07T20:32:55.7062449Z )
2025-05-07T20:32:55.7619263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.7620432Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.7621115Z 
2025-05-07T20:32:55.7621285Z     @given(
2025-05-07T20:32:55.7621758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.7622387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.7623015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.7623690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.7624401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.7624988Z     )
2025-05-07T20:32:55.7625509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.7625960Z     def test_silu_mul_quant(
2025-05-07T20:32:55.7626217Z         self,
2025-05-07T20:32:55.7626419Z         T: int,
2025-05-07T20:32:55.7626638Z         D: int,
2025-05-07T20:32:55.7626866Z         scale_ub: Optional[float],
2025-05-07T20:32:55.7627143Z         contiguous: bool,
2025-05-07T20:32:55.7627398Z         compiled: bool,
2025-05-07T20:32:55.7627637Z     ) -> None:
2025-05-07T20:32:55.7627857Z         torch.manual_seed(2025)
2025-05-07T20:32:55.7628112Z     
2025-05-07T20:32:55.7628403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.7628748Z     
2025-05-07T20:32:55.7628951Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.7629252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.7629565Z         x = x_sign * x_clamp
2025-05-07T20:32:55.7629815Z         x0 = x[:, :D]
2025-05-07T20:32:55.7630046Z         x1 = x[:, D:]
2025-05-07T20:32:55.7630546Z     
2025-05-07T20:32:55.7630738Z         if contiguous:
2025-05-07T20:32:55.7630978Z             x0 = x0.contiguous()
2025-05-07T20:32:55.7631248Z             x1 = x1.contiguous()
2025-05-07T20:32:55.7631617Z     
2025-05-07T20:32:55.7631818Z         if scale_ub is not None:
2025-05-07T20:32:55.7632100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.7632438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.7632761Z             )
2025-05-07T20:32:55.7632963Z         else:
2025-05-07T20:32:55.7633179Z             scale_ub_tensor = None
2025-05-07T20:32:55.7633441Z     
2025-05-07T20:32:55.7633687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7634006Z             op = silu_mul_quant
2025-05-07T20:32:55.7634266Z             if compiled:
2025-05-07T20:32:55.7634525Z                 op = torch.compile(op)
2025-05-07T20:32:55.7634826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7635119Z     
2025-05-07T20:32:55.7635329Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.7635625Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.7636033Z     
2025-05-07T20:32:55.7636283Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7636635Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.7636935Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.7637258Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.7637629Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.7637943Z     
2025-05-07T20:32:55.7638158Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.7638356Z 
2025-05-07T20:32:55.7638464Z moe/activation_test.py:126: 
2025-05-07T20:32:55.7638768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7639118Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.7639461Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.7640254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.7641014Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.7641573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.7642268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.7642968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.7643691Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.7644431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.7645079Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.7645690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.7646219Z     fn()
2025-05-07T20:32:55.7646734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.7647330Z     self.fn.run(
2025-05-07T20:32:55.7647800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.7648341Z     kernel = self.compile(
2025-05-07T20:32:55.7648893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.7649548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7649955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7650195Z 
2025-05-07T20:32:55.7650534Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91baa9c0>
2025-05-07T20:32:55.7651634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.7653108Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90281620>}
2025-05-07T20:32:55.7654452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.7655490Z context = <triton._C.libtriton.ir.context object at 0x7f8b67fa0070>
2025-05-07T20:32:55.7655790Z 
2025-05-07T20:32:55.7655964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.7656503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7656980Z                            module_map=module_map)
2025-05-07T20:32:55.7657361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7657746Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.7658020Z E       ^
2025-05-07T20:32:55.7658495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7658955Z 
2025-05-07T20:32:55.7659372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.7659886Z 
2025-05-07T20:32:55.7660003Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.7660423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.7660840Z     T=128,
2025-05-07T20:32:55.7661045Z     D=7168,
2025-05-07T20:32:55.7661249Z     scale_ub=None,
2025-05-07T20:32:55.7661489Z     contiguous=False,
2025-05-07T20:32:55.7661725Z     compiled=False,
2025-05-07T20:32:55.7661936Z )
2025-05-07T20:32:55.9637822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9638555Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.9638865Z 
2025-05-07T20:32:55.9638951Z     @given(
2025-05-07T20:32:55.9639192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9639515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9639825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9640162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9640502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9640794Z     )
2025-05-07T20:32:55.9641153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9641605Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9641866Z         self,
2025-05-07T20:32:55.9642071Z         T: int,
2025-05-07T20:32:55.9642279Z         D: int,
2025-05-07T20:32:55.9642498Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9642788Z         contiguous: bool,
2025-05-07T20:32:55.9643050Z         compiled: bool,
2025-05-07T20:32:55.9643282Z     ) -> None:
2025-05-07T20:32:55.9643512Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9643765Z     
2025-05-07T20:32:55.9644039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9644397Z     
2025-05-07T20:32:55.9644599Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9644902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9645212Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9645492Z         x0 = x[:, :D]
2025-05-07T20:32:55.9645749Z         x1 = x[:, D:]
2025-05-07T20:32:55.9645959Z     
2025-05-07T20:32:55.9646156Z         if contiguous:
2025-05-07T20:32:55.9646399Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9647018Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9647270Z     
2025-05-07T20:32:55.9647474Z         if scale_ub is not None:
2025-05-07T20:32:55.9647756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9648240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9648565Z             )
2025-05-07T20:32:55.9648763Z         else:
2025-05-07T20:32:55.9648984Z             scale_ub_tensor = None
2025-05-07T20:32:55.9649243Z     
2025-05-07T20:32:55.9649480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9649805Z             op = silu_mul_quant
2025-05-07T20:32:55.9650068Z             if compiled:
2025-05-07T20:32:55.9650320Z                 op = torch.compile(op)
2025-05-07T20:32:55.9650617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9650902Z     
2025-05-07T20:32:55.9651108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9651274Z 
2025-05-07T20:32:55.9651378Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9651692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9652032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9652315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9653020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9653720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9654265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9654949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9655672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9656219Z     kernel = self.compile(
2025-05-07T20:32:55.9656766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9657428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9657835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9658075Z 
2025-05-07T20:32:55.9658290Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f16210>
2025-05-07T20:32:55.9659376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9660859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90282160>}
2025-05-07T20:32:55.9662297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9663339Z context = <triton._C.libtriton.ir.context object at 0x7f8b6780ccf0>
2025-05-07T20:32:55.9663629Z 
2025-05-07T20:32:55.9663811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9664338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9664822Z                            module_map=module_map)
2025-05-07T20:32:55.9665195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9665906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9666173Z E       ^
2025-05-07T20:32:55.9666639Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9667090Z 
2025-05-07T20:32:55.9667514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9668209Z 
2025-05-07T20:32:55.9668321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9668740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9669263Z     T=4096,
2025-05-07T20:32:55.9669456Z     D=5120,
2025-05-07T20:32:55.9669660Z     scale_ub=1200.0,
2025-05-07T20:32:55.9669888Z     contiguous=True,
2025-05-07T20:32:55.9670111Z     compiled=False,
2025-05-07T20:32:55.9670326Z )
2025-05-07T20:32:55.9670654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9671155Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.9671431Z 
2025-05-07T20:32:55.9671516Z     @given(
2025-05-07T20:32:55.9671757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9672077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9672385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9680607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9680966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9681260Z     )
2025-05-07T20:32:55.9681623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9682087Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9682335Z         self,
2025-05-07T20:32:55.9682549Z         T: int,
2025-05-07T20:32:55.9682757Z         D: int,
2025-05-07T20:32:55.9682978Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9683260Z         contiguous: bool,
2025-05-07T20:32:55.9683512Z         compiled: bool,
2025-05-07T20:32:55.9683741Z     ) -> None:
2025-05-07T20:32:55.9683971Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9684225Z     
2025-05-07T20:32:55.9684507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9684868Z     
2025-05-07T20:32:55.9685064Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9685367Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9685691Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9685941Z         x0 = x[:, :D]
2025-05-07T20:32:55.9686161Z         x1 = x[:, D:]
2025-05-07T20:32:55.9686380Z     
2025-05-07T20:32:55.9686585Z         if contiguous:
2025-05-07T20:32:55.9686824Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9687103Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9687356Z     
2025-05-07T20:32:55.9687552Z         if scale_ub is not None:
2025-05-07T20:32:55.9687834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9688179Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9688492Z             )
2025-05-07T20:32:55.9688697Z         else:
2025-05-07T20:32:55.9688910Z             scale_ub_tensor = None
2025-05-07T20:32:55.9689166Z     
2025-05-07T20:32:55.9689406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9689737Z             op = silu_mul_quant
2025-05-07T20:32:55.9689990Z             if compiled:
2025-05-07T20:32:55.9690253Z                 op = torch.compile(op)
2025-05-07T20:32:55.9690560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9690846Z     
2025-05-07T20:32:55.9691047Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9691228Z 
2025-05-07T20:32:55.9691335Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9691641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9691978Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9692270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9692979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9693666Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9694209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9695012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9695694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9696227Z     kernel = self.compile(
2025-05-07T20:32:55.9696850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9697514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9697922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9698153Z 
2025-05-07T20:32:55.9698362Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f17980>
2025-05-07T20:32:55.9699449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9700837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9204efc0>}
2025-05-07T20:32:55.9702182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9703212Z context = <triton._C.libtriton.ir.context object at 0x7f8b6781c970>
2025-05-07T20:32:55.9703510Z 
2025-05-07T20:32:55.9703679Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9704214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9704689Z                            module_map=module_map)
2025-05-07T20:32:55.9705054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9705415Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9705684Z E       ^
2025-05-07T20:32:55.9706152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9706616Z 
2025-05-07T20:32:55.9707037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9707556Z 
2025-05-07T20:32:55.9707662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9708087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9708496Z     T=1,
2025-05-07T20:32:55.9708691Z     D=5120,
2025-05-07T20:32:55.9708898Z     scale_ub=None,
2025-05-07T20:32:55.9709117Z     contiguous=True,
2025-05-07T20:32:55.9709351Z     compiled=True,
2025-05-07T20:32:55.9709569Z )
2025-05-07T20:32:56.2101028Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.2103190Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:56.2105737Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.2107203Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.2108191Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2109876Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.2111452Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.2112627Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2113865Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.2115449Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.2116673Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2117975Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.2119237Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:56.2120467Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.2121678Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:56.2122513Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2123546Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:56.2124578Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:56.2125383Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:56.2126640Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.2127932Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.2129056Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:56.2130118Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:56.2131308Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.2132669Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.2133834Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.2134759Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.2135582Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:56.2136604Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.2801506Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.2802742Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:56.2804128Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.2805597Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.2806615Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2807929Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.2809327Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.2810326Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2811560Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.2812947Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.2814031Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2815323Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.2816591Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:56.2817815Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.2819034Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:56.2820202Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2821239Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:56.2822406Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:56.2823206Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:56.2824420Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.2825729Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.2826871Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:56.2827920Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:56.2829104Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.2830470Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.2831539Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.2832453Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.2833200Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:56.2834235Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.5799886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.5800619Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:56.5800967Z 
2025-05-07T20:32:56.5801052Z     @given(
2025-05-07T20:32:56.5801296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.5801614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.5801929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.5802293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.5802629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.5802925Z     )
2025-05-07T20:32:56.5803298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.5803746Z     def test_silu_mul_quant(
2025-05-07T20:32:56.5803989Z         self,
2025-05-07T20:32:56.5804194Z         T: int,
2025-05-07T20:32:56.5804399Z         D: int,
2025-05-07T20:32:56.5804616Z         scale_ub: Optional[float],
2025-05-07T20:32:56.5804903Z         contiguous: bool,
2025-05-07T20:32:56.5805150Z         compiled: bool,
2025-05-07T20:32:56.5805381Z     ) -> None:
2025-05-07T20:32:56.5805602Z         torch.manual_seed(2025)
2025-05-07T20:32:56.5805850Z     
2025-05-07T20:32:56.5806123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.5806470Z     
2025-05-07T20:32:56.5806668Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.5807269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.5807593Z         x = x_sign * x_clamp
2025-05-07T20:32:56.5807836Z         x0 = x[:, :D]
2025-05-07T20:32:56.5808055Z         x1 = x[:, D:]
2025-05-07T20:32:56.5808438Z     
2025-05-07T20:32:56.5808628Z         if contiguous:
2025-05-07T20:32:56.5808867Z             x0 = x0.contiguous()
2025-05-07T20:32:56.5809123Z             x1 = x1.contiguous()
2025-05-07T20:32:56.5809367Z     
2025-05-07T20:32:56.5809566Z         if scale_ub is not None:
2025-05-07T20:32:56.5809839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.5810182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.5810496Z             )
2025-05-07T20:32:56.5810686Z         else:
2025-05-07T20:32:56.5810903Z             scale_ub_tensor = None
2025-05-07T20:32:56.5811165Z     
2025-05-07T20:32:56.5811398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.5811718Z             op = silu_mul_quant
2025-05-07T20:32:56.5811983Z             if compiled:
2025-05-07T20:32:56.5812233Z                 op = torch.compile(op)
2025-05-07T20:32:56.5812534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5812820Z     
2025-05-07T20:32:56.5813019Z         y_fp8, y_scale = fn()
2025-05-07T20:32:56.5813306Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:56.5813606Z     
2025-05-07T20:32:56.5813850Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.5814184Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:56.5814481Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:56.5814801Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:56.5815157Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.5815475Z     
2025-05-07T20:32:56.5815693Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:56.5815920Z 
2025-05-07T20:32:56.5816031Z moe/activation_test.py:126: 
2025-05-07T20:32:56.5816337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5816681Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:56.5817011Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.5817802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:56.5818560Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:56.5819107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.5819795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.5820478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:56.5821201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.5821943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:56.5822581Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:56.5823186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:56.5823709Z     fn()
2025-05-07T20:32:56.5824219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:56.5824794Z     self.fn.run(
2025-05-07T20:32:56.5825265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.5825802Z     kernel = self.compile(
2025-05-07T20:32:56.5826335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.5827160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.5827569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5827800Z 
2025-05-07T20:32:56.5828014Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b901bcce0>
2025-05-07T20:32:56.5829171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.5830561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b901cdee0>}
2025-05-07T20:32:56.5831903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.5832936Z context = <triton._C.libtriton.ir.context object at 0x7f8b678b35b0>
2025-05-07T20:32:56.5833224Z 
2025-05-07T20:32:56.5833399Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.5833919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.5834401Z                            module_map=module_map)
2025-05-07T20:32:56.5834769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.5835125Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:56.5835399Z E       ^
2025-05-07T20:32:56.5836036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.5836505Z 
2025-05-07T20:32:56.5836923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.5837434Z 
2025-05-07T20:32:56.5837540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.5837962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.5838376Z     T=2048,
2025-05-07T20:32:56.5838569Z     D=5120,
2025-05-07T20:32:56.5838770Z     scale_ub=None,
2025-05-07T20:32:56.5839002Z     contiguous=True,
2025-05-07T20:32:56.5839225Z     compiled=True,
2025-05-07T20:32:56.5839442Z )
2025-05-07T20:32:56.8092553Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.8093664Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:56.8095025Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.8096536Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.8097534Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8098848Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.8100243Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.8101601Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8102840Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.8104371Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.8105441Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8106782Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.8108047Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:56.8109262Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.8110488Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:56.8111323Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8112359Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:56.8113384Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:56.8114176Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:56.8115399Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.8116791Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.8117912Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:56.8118957Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:56.8120139Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.8121508Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.8122574Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.8123493Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.8124233Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:56.8125352Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.8797760Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.8798841Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:56.8800205Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.8801758Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.8802739Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8804059Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.8805451Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.8806487Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8807728Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.8809101Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.8810179Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8811468Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.8812726Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:56.8813952Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.8815161Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:56.8815999Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.8817027Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:56.8818048Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:56.8819182Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:56.8820401Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.8822247Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.8823368Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:56.8824411Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:56.8825592Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.8827006Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.8828085Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.8829006Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.8829750Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:56.8830778Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.1833531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.1834611Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:57.1835199Z 
2025-05-07T20:32:57.1835364Z     @given(
2025-05-07T20:32:57.1835911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.1836308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.1836660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.1836997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.1837330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.1837617Z     )
2025-05-07T20:32:57.1837973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.1838431Z     def test_silu_mul_quant(
2025-05-07T20:32:57.1838678Z         self,
2025-05-07T20:32:57.1838888Z         T: int,
2025-05-07T20:32:57.1839107Z         D: int,
2025-05-07T20:32:57.1839328Z         scale_ub: Optional[float],
2025-05-07T20:32:57.1839608Z         contiguous: bool,
2025-05-07T20:32:57.1839866Z         compiled: bool,
2025-05-07T20:32:57.1840101Z     ) -> None:
2025-05-07T20:32:57.1840331Z         torch.manual_seed(2025)
2025-05-07T20:32:57.1840587Z     
2025-05-07T20:32:57.1840864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.1841226Z     
2025-05-07T20:32:57.1841428Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.1841727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.1842042Z         x = x_sign * x_clamp
2025-05-07T20:32:57.1842294Z         x0 = x[:, :D]
2025-05-07T20:32:57.1842515Z         x1 = x[:, D:]
2025-05-07T20:32:57.1842723Z     
2025-05-07T20:32:57.1842914Z         if contiguous:
2025-05-07T20:32:57.1843150Z             x0 = x0.contiguous()
2025-05-07T20:32:57.1843412Z             x1 = x1.contiguous()
2025-05-07T20:32:57.1843664Z     
2025-05-07T20:32:57.1844197Z         if scale_ub is not None:
2025-05-07T20:32:57.1844476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.1844821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.1845284Z             )
2025-05-07T20:32:57.1845484Z         else:
2025-05-07T20:32:57.1845702Z             scale_ub_tensor = None
2025-05-07T20:32:57.1845964Z     
2025-05-07T20:32:57.1846194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.1846515Z             op = silu_mul_quant
2025-05-07T20:32:57.1846775Z             if compiled:
2025-05-07T20:32:57.1847027Z                 op = torch.compile(op)
2025-05-07T20:32:57.1847328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.1847610Z     
2025-05-07T20:32:57.1847808Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.1848092Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.1848390Z     
2025-05-07T20:32:57.1848641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.1848980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.1849279Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.1849601Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.1849966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.1850286Z     
2025-05-07T20:32:57.1850496Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.1850694Z 
2025-05-07T20:32:57.1850808Z moe/activation_test.py:126: 
2025-05-07T20:32:57.1851110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.1851458Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.1851797Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.1852591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.1853357Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.1853929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.1862246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.1862961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.1863703Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.1864441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.1865090Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.1865979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.1866558Z     fn()
2025-05-07T20:32:57.1867084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.1867667Z     self.fn.run(
2025-05-07T20:32:57.1868140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.1868683Z     kernel = self.compile(
2025-05-07T20:32:57.1869224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.1869875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.1870291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.1870528Z 
2025-05-07T20:32:57.1870745Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90870830>
2025-05-07T20:32:57.1872072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.1873471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90a64ea0>}
2025-05-07T20:32:57.1874954Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.1876094Z context = <triton._C.libtriton.ir.context object at 0x7f8b673c62b0>
2025-05-07T20:32:57.1876410Z 
2025-05-07T20:32:57.1876581Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.1877116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.1877594Z                            module_map=module_map)
2025-05-07T20:32:57.1877980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.1878343Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.1878618Z E       ^
2025-05-07T20:32:57.1879095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.1879555Z 
2025-05-07T20:32:57.1879971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.1880494Z 
2025-05-07T20:32:57.1880602Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.1881028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.1881441Z     T=128,
2025-05-07T20:32:57.1881636Z     D=5120,
2025-05-07T20:32:57.1881846Z     scale_ub=None,
2025-05-07T20:32:57.1882070Z     contiguous=True,
2025-05-07T20:32:57.1882298Z     compiled=True,
2025-05-07T20:32:57.1882516Z )
2025-05-07T20:32:57.4291525Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:57.4292628Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:57.4293982Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:57.4295415Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:57.4296402Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.4297716Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:57.4299105Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.4300104Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.4301329Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:57.4303046Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.4304127Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.4305639Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:57.4306894Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:57.4308107Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:57.4309326Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:57.4310160Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.4311197Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:57.4312217Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:57.4313008Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:57.4314221Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:57.4315508Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:57.4316709Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:57.4317759Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:57.4318932Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:57.4320300Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:57.4321370Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.4322286Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.4323028Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:57.4324060Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.4993371Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:57.4994770Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:57.4996202Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:57.4997781Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:57.4998763Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.5000073Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:57.5001464Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.5002463Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.5003692Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:57.5005060Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.5006136Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.5007417Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:57.5008671Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:57.5009891Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:57.5011092Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:57.5011925Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:57.5012951Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:57.5013975Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:57.5014762Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:57.5015973Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:57.5017339Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:57.5018459Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:57.5019578Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:57.5020749Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:57.5022103Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:57.5023169Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.5024081Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.5024830Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:57.5025842Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.8430255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.8430884Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:57.8431201Z 
2025-05-07T20:32:57.8431285Z     @given(
2025-05-07T20:32:57.8431532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.8431871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.8432185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.8432524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.8432860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.8433154Z     )
2025-05-07T20:32:57.8433522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.8433964Z     def test_silu_mul_quant(
2025-05-07T20:32:57.8434210Z         self,
2025-05-07T20:32:57.8434408Z         T: int,
2025-05-07T20:32:57.8434606Z         D: int,
2025-05-07T20:32:57.8434828Z         scale_ub: Optional[float],
2025-05-07T20:32:57.8435106Z         contiguous: bool,
2025-05-07T20:32:57.8435347Z         compiled: bool,
2025-05-07T20:32:57.8435580Z     ) -> None:
2025-05-07T20:32:57.8435911Z         torch.manual_seed(2025)
2025-05-07T20:32:57.8436157Z     
2025-05-07T20:32:57.8436425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.8436780Z     
2025-05-07T20:32:57.8436979Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.8437266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.8437580Z         x = x_sign * x_clamp
2025-05-07T20:32:57.8437832Z         x0 = x[:, :D]
2025-05-07T20:32:57.8438051Z         x1 = x[:, D:]
2025-05-07T20:32:57.8438271Z     
2025-05-07T20:32:57.8438460Z         if contiguous:
2025-05-07T20:32:57.8438688Z             x0 = x0.contiguous()
2025-05-07T20:32:57.8438952Z             x1 = x1.contiguous()
2025-05-07T20:32:57.8439194Z     
2025-05-07T20:32:57.8439384Z         if scale_ub is not None:
2025-05-07T20:32:57.8439662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.8440003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.8440318Z             )
2025-05-07T20:32:57.8440521Z         else:
2025-05-07T20:32:57.8440740Z             scale_ub_tensor = None
2025-05-07T20:32:57.8440987Z     
2025-05-07T20:32:57.8441558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8441875Z             op = silu_mul_quant
2025-05-07T20:32:57.8442125Z             if compiled:
2025-05-07T20:32:57.8442372Z                 op = torch.compile(op)
2025-05-07T20:32:57.8442809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8443084Z     
2025-05-07T20:32:57.8443274Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.8443560Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.8443949Z     
2025-05-07T20:32:57.8444193Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8444616Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.8444917Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.8445239Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.8445608Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.8445931Z     
2025-05-07T20:32:57.8446141Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.8446347Z 
2025-05-07T20:32:57.8446455Z moe/activation_test.py:126: 
2025-05-07T20:32:57.8446764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8447118Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.8447454Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.8448251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.8449011Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.8449556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.8450250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.8450947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.8451684Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.8452428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.8453087Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.8453698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.8454223Z     fn()
2025-05-07T20:32:57.8454731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.8455323Z     self.fn.run(
2025-05-07T20:32:57.8455799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.8456339Z     kernel = self.compile(
2025-05-07T20:32:57.8456895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.8457557Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.8457968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8458205Z 
2025-05-07T20:32:57.8458415Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90a7a000>
2025-05-07T20:32:57.8459509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.8460913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66e86f20>}
2025-05-07T20:32:57.8462427Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.8463475Z context = <triton._C.libtriton.ir.context object at 0x7f8b66d9fe30>
2025-05-07T20:32:57.8463767Z 
2025-05-07T20:32:57.8464074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.8464615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.8465095Z                            module_map=module_map)
2025-05-07T20:32:57.8465803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.8466177Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.8466450Z E       ^
2025-05-07T20:32:57.8466916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.8467374Z 
2025-05-07T20:32:57.8467787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.8468310Z 
2025-05-07T20:32:57.8468416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.8468832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.8469240Z     T=4096,
2025-05-07T20:32:57.8469436Z     D=5120,
2025-05-07T20:32:57.8469635Z     scale_ub=None,
2025-05-07T20:32:57.8469847Z     contiguous=True,
2025-05-07T20:32:57.8470075Z     compiled=True,
2025-05-07T20:32:57.8470290Z )
2025-05-07T20:32:58.0894696Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:58.0896055Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:58.0897475Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:58.0898932Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:58.0899919Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.0901225Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:58.0902612Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.0903613Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.0904846Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:58.0906224Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.0907303Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.0908920Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:58.0910175Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:58.0911537Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:58.0912741Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:58.0913570Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.0914607Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:58.0915630Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:58.0916563Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:58.0917768Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:58.0919055Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:58.0920182Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:58.0921233Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:58.0922409Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:58.0923777Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:58.0924845Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.0925760Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.0926513Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:58.0927533Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1616214Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:58.1617553Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:58.1618895Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
﻿2025-05-07T20:32:58.1624132Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:58.1625128Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.1626522Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:58.1627962Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1628964Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.1630216Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:58.1631601Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1632684Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.1633966Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:58.1635225Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:58.1636538Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:58.1637806Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:58.1638639Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:58.1639674Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:58.1640705Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:58.1641502Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:58.1642723Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:58.1644009Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:58.1645131Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:58.1646262Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:58.1647537Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:58.1648938Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:58.1650008Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1650928Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1651673Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:58.1652703Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.5083526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.5084298Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.5084668Z 
2025-05-07T20:32:58.5084793Z     @given(
2025-05-07T20:32:58.5085059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.5085383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.5085700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.5086029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.5086364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.5086660Z     )
2025-05-07T20:32:58.5087010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.5087472Z     def test_silu_mul_quant(
2025-05-07T20:32:58.5087726Z         self,
2025-05-07T20:32:58.5087924Z         T: int,
2025-05-07T20:32:58.5088129Z         D: int,
2025-05-07T20:32:58.5088358Z         scale_ub: Optional[float],
2025-05-07T20:32:58.5088639Z         contiguous: bool,
2025-05-07T20:32:58.5088896Z         compiled: bool,
2025-05-07T20:32:58.5089139Z     ) -> None:
2025-05-07T20:32:58.5089362Z         torch.manual_seed(2025)
2025-05-07T20:32:58.5089604Z     
2025-05-07T20:32:58.5089885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.5090235Z     
2025-05-07T20:32:58.5090433Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.5090728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.5091065Z         x = x_sign * x_clamp
2025-05-07T20:32:58.5091317Z         x0 = x[:, :D]
2025-05-07T20:32:58.5091531Z         x1 = x[:, D:]
2025-05-07T20:32:58.5091740Z     
2025-05-07T20:32:58.5091931Z         if contiguous:
2025-05-07T20:32:58.5092171Z             x0 = x0.contiguous()
2025-05-07T20:32:58.5092439Z             x1 = x1.contiguous()
2025-05-07T20:32:58.5092684Z     
2025-05-07T20:32:58.5092875Z         if scale_ub is not None:
2025-05-07T20:32:58.5093158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.5093498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.5093805Z             )
2025-05-07T20:32:58.5094012Z         else:
2025-05-07T20:32:58.5094234Z             scale_ub_tensor = None
2025-05-07T20:32:58.5094488Z     
2025-05-07T20:32:58.5094735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.5095056Z             op = silu_mul_quant
2025-05-07T20:32:58.5095305Z             if compiled:
2025-05-07T20:32:58.5095562Z                 op = torch.compile(op)
2025-05-07T20:32:58.5095870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.5096150Z     
2025-05-07T20:32:58.5096352Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.5096997Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.5097413Z     
2025-05-07T20:32:58.5097650Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.5097996Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.5098368Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.5098682Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.5099043Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.5099357Z     
2025-05-07T20:32:58.5099561Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.5099763Z 
2025-05-07T20:32:58.5099867Z moe/activation_test.py:126: 
2025-05-07T20:32:58.5100168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.5100510Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.5100836Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.5101631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.5102388Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.5102929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.5103616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.5104304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.5105027Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.5105751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.5106392Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.5106998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.5107519Z     fn()
2025-05-07T20:32:58.5108022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.5108603Z     self.fn.run(
2025-05-07T20:32:58.5109074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.5109602Z     kernel = self.compile(
2025-05-07T20:32:58.5110143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.5110796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.5111198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.5111432Z 
2025-05-07T20:32:58.5111640Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9066bad0>
2025-05-07T20:32:58.5112730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.5114125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b671162a0>}
2025-05-07T20:32:58.5115467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.5116617Z context = <triton._C.libtriton.ir.context object at 0x7f8b67795ff0>
2025-05-07T20:32:58.5116945Z 
2025-05-07T20:32:58.5117131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.5117659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.5118292Z                            module_map=module_map)
2025-05-07T20:32:58.5118659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.5119030Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.5119373Z E       ^
2025-05-07T20:32:58.5119839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.5120301Z 
2025-05-07T20:32:58.5120715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.5121231Z 
2025-05-07T20:32:58.5121336Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.5121754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.5122157Z     T=16384,
2025-05-07T20:32:58.5122361Z     D=5120,
2025-05-07T20:32:58.5122563Z     scale_ub=None,
2025-05-07T20:32:58.5122783Z     contiguous=True,
2025-05-07T20:32:58.5123012Z     compiled=True,
2025-05-07T20:32:58.5123236Z )
2025-05-07T20:32:58.5380417Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:58.5381961Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:58.5383312Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:58.5384306Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:58.5385425Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:58.6270602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6278669Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.6278982Z 
2025-05-07T20:32:58.6279074Z     @given(
2025-05-07T20:32:58.6279318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6279651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6279974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6280316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6280660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6280972Z     )
2025-05-07T20:32:58.6281333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6281787Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6282046Z         self,
2025-05-07T20:32:58.6282256Z         T: int,
2025-05-07T20:32:58.6282458Z         D: int,
2025-05-07T20:32:58.6282706Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6282985Z         contiguous: bool,
2025-05-07T20:32:58.6283238Z         compiled: bool,
2025-05-07T20:32:58.6283480Z     ) -> None:
2025-05-07T20:32:58.6283705Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6283968Z     
2025-05-07T20:32:58.6284256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6284617Z     
2025-05-07T20:32:58.6284820Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6285130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6285453Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6285703Z         x0 = x[:, :D]
2025-05-07T20:32:58.6285934Z         x1 = x[:, D:]
2025-05-07T20:32:58.6286160Z     
2025-05-07T20:32:58.6286354Z         if contiguous:
2025-05-07T20:32:58.6286603Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6286878Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6287123Z     
2025-05-07T20:32:58.6287731Z         if scale_ub is not None:
2025-05-07T20:32:58.6288018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6288359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6288757Z             )
2025-05-07T20:32:58.6288959Z         else:
2025-05-07T20:32:58.6289173Z             scale_ub_tensor = None
2025-05-07T20:32:58.6289434Z     
2025-05-07T20:32:58.6289680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6290002Z             op = silu_mul_quant
2025-05-07T20:32:58.6290264Z             if compiled:
2025-05-07T20:32:58.6290522Z                 op = torch.compile(op)
2025-05-07T20:32:58.6290825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6291103Z     
2025-05-07T20:32:58.6291308Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.6291603Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.6291896Z     
2025-05-07T20:32:58.6292149Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6292503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.6292804Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.6293129Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.6293509Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.6293823Z     
2025-05-07T20:32:58.6294041Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.6294244Z 
2025-05-07T20:32:58.6294348Z moe/activation_test.py:126: 
2025-05-07T20:32:58.6294661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6295000Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.6295338Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.6296137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.6296894Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.6297458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6298160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6298863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.6299587Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.6300328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.6300977Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.6301590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.6302111Z     fn()
2025-05-07T20:32:58.6302636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.6303229Z     self.fn.run(
2025-05-07T20:32:58.6303703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6304247Z     kernel = self.compile(
2025-05-07T20:32:58.6304801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6305469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6305876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6306121Z 
2025-05-07T20:32:58.6306333Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90668b60>
2025-05-07T20:32:58.6307519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6308962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668df240>}
2025-05-07T20:32:58.6310347Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6311389Z context = <triton._C.libtriton.ir.context object at 0x7f8b6728d330>
2025-05-07T20:32:58.6311697Z 
2025-05-07T20:32:58.6311867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6312405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6312882Z                            module_map=module_map)
2025-05-07T20:32:58.6313258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6313635Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.6313905Z E       ^
2025-05-07T20:32:58.6314382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6314852Z 
2025-05-07T20:32:58.6315276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6315884Z 
2025-05-07T20:32:58.6316000Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6316420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6316834Z     T=1,
2025-05-07T20:32:58.6317034Z     D=5120,
2025-05-07T20:32:58.6317232Z     scale_ub=1200.0,
2025-05-07T20:32:58.6317469Z     contiguous=True,
2025-05-07T20:32:58.6317706Z     compiled=True,
2025-05-07T20:32:58.6317921Z )
2025-05-07T20:32:58.7707069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.7707883Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.7708260Z 
2025-05-07T20:32:58.7708374Z     @given(
2025-05-07T20:32:58.7708663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.7708985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.7709304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.7709647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.7709980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.7710278Z     )
2025-05-07T20:32:58.7710635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.7711085Z     def test_silu_mul_quant(
2025-05-07T20:32:58.7711335Z         self,
2025-05-07T20:32:58.7711546Z         T: int,
2025-05-07T20:32:58.7711757Z         D: int,
2025-05-07T20:32:58.7711987Z         scale_ub: Optional[float],
2025-05-07T20:32:58.7712279Z         contiguous: bool,
2025-05-07T20:32:58.7712540Z         compiled: bool,
2025-05-07T20:32:58.7712771Z     ) -> None:
2025-05-07T20:32:58.7712999Z         torch.manual_seed(2025)
2025-05-07T20:32:58.7713253Z     
2025-05-07T20:32:58.7713525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.7713886Z     
2025-05-07T20:32:58.7714097Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.7714392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.7714712Z         x = x_sign * x_clamp
2025-05-07T20:32:58.7714968Z         x0 = x[:, :D]
2025-05-07T20:32:58.7715187Z         x1 = x[:, D:]
2025-05-07T20:32:58.7715408Z     
2025-05-07T20:32:58.7715614Z         if contiguous:
2025-05-07T20:32:58.7715971Z             x0 = x0.contiguous()
2025-05-07T20:32:58.7716237Z             x1 = x1.contiguous()
2025-05-07T20:32:58.7716489Z     
2025-05-07T20:32:58.7716695Z         if scale_ub is not None:
2025-05-07T20:32:58.7716978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.7717661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.7718058Z             )
2025-05-07T20:32:58.7718267Z         else:
2025-05-07T20:32:58.7718488Z             scale_ub_tensor = None
2025-05-07T20:32:58.7718811Z     
2025-05-07T20:32:58.7719058Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.7719384Z             op = silu_mul_quant
2025-05-07T20:32:58.7719637Z             if compiled:
2025-05-07T20:32:58.7719891Z                 op = torch.compile(op)
2025-05-07T20:32:58.7720195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7720471Z     
2025-05-07T20:32:58.7720682Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.7720846Z 
2025-05-07T20:32:58.7720958Z moe/activation_test.py:117: 
2025-05-07T20:32:58.7721256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7721595Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.7721879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7722450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.7723008Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.7723668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.7724361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.7724895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.7725578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.7726248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.7726783Z     kernel = self.compile(
2025-05-07T20:32:58.7727323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.7727991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.7728401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7728633Z 
2025-05-07T20:32:58.7728847Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a1580>
2025-05-07T20:32:58.7729926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.7731327Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bb8cc0>}
2025-05-07T20:32:58.7732675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.7733719Z context = <triton._C.libtriton.ir.context object at 0x7f8a69b568b0>
2025-05-07T20:32:58.7734008Z 
2025-05-07T20:32:58.7734187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.7734713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.7735193Z                            module_map=module_map)
2025-05-07T20:32:58.7735563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.7735925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.7736193Z E       ^
2025-05-07T20:32:58.7736665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.7737117Z 
2025-05-07T20:32:58.7737537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7738049Z 
2025-05-07T20:32:58.7738244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7738735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7739146Z     T=1,
2025-05-07T20:32:58.7739337Z     D=5120,
2025-05-07T20:32:58.7739579Z     scale_ub=None,
2025-05-07T20:32:58.7739803Z     contiguous=False,
2025-05-07T20:32:58.7740033Z     compiled=True,
2025-05-07T20:32:58.7740243Z )
2025-05-07T20:32:58.8364384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8366196Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.8366863Z 
2025-05-07T20:32:58.8366990Z     @given(
2025-05-07T20:32:58.8367282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8367608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8367928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8368262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8368615Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8368914Z     )
2025-05-07T20:32:58.8369268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8369727Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8369976Z         self,
2025-05-07T20:32:58.8370171Z         T: int,
2025-05-07T20:32:58.8370375Z         D: int,
2025-05-07T20:32:58.8370597Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8370875Z         contiguous: bool,
2025-05-07T20:32:58.8371119Z         compiled: bool,
2025-05-07T20:32:58.8371352Z     ) -> None:
2025-05-07T20:32:58.8371575Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8371819Z     
2025-05-07T20:32:58.8372100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8372448Z     
2025-05-07T20:32:58.8372648Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8372944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8373261Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8373508Z         x0 = x[:, :D]
2025-05-07T20:32:58.8373730Z         x1 = x[:, D:]
2025-05-07T20:32:58.8373943Z     
2025-05-07T20:32:58.8374131Z         if contiguous:
2025-05-07T20:32:58.8374370Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8374635Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8374877Z     
2025-05-07T20:32:58.8375079Z         if scale_ub is not None:
2025-05-07T20:32:58.8375360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8375696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8376015Z             )
2025-05-07T20:32:58.8376215Z         else:
2025-05-07T20:32:58.8376431Z             scale_ub_tensor = None
2025-05-07T20:32:58.8376686Z     
2025-05-07T20:32:58.8376926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8377253Z             op = silu_mul_quant
2025-05-07T20:32:58.8377504Z             if compiled:
2025-05-07T20:32:58.8377763Z                 op = torch.compile(op)
2025-05-07T20:32:58.8378072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8378349Z     
2025-05-07T20:32:58.8378550Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.8378844Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.8379138Z     
2025-05-07T20:32:58.8379387Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8379733Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.8380030Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.8380351Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.8380720Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8381043Z     
2025-05-07T20:32:58.8381248Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8381455Z 
2025-05-07T20:32:58.8381561Z moe/activation_test.py:126: 
2025-05-07T20:32:58.8381868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8382560Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.8382897Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8383689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.8384525Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8385070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8385766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8386461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.8387232Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8387979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.8388626Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8389230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.8389750Z     fn()
2025-05-07T20:32:58.8390261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.8390853Z     self.fn.run(
2025-05-07T20:32:58.8391318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8391856Z     kernel = self.compile(
2025-05-07T20:32:58.8392399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8393058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8393458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8393702Z 
2025-05-07T20:32:58.8393911Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66ba8ec0>
2025-05-07T20:32:58.8395007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8396526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bbb2e0>}
2025-05-07T20:32:58.8397928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8398953Z context = <triton._C.libtriton.ir.context object at 0x7f8a69bbd570>
2025-05-07T20:32:58.8399248Z 
2025-05-07T20:32:58.8399426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8399964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8400437Z                            module_map=module_map)
2025-05-07T20:32:58.8400812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8401180Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.8401461Z E       ^
2025-05-07T20:32:58.8401929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8402393Z 
2025-05-07T20:32:58.8402809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8403321Z 
2025-05-07T20:32:58.8403436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8403865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8404403Z     T=1,
2025-05-07T20:32:58.8404599Z     D=5120,
2025-05-07T20:32:58.8404803Z     scale_ub=None,
2025-05-07T20:32:58.8405019Z     contiguous=True,
2025-05-07T20:32:58.8405251Z     compiled=False,
2025-05-07T20:32:58.8405509Z )
2025-05-07T20:32:58.9930005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9930775Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9931149Z 
2025-05-07T20:32:58.9931263Z     @given(
2025-05-07T20:32:58.9931561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9931883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9932201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9932539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9932881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9933172Z     )
2025-05-07T20:32:58.9933559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9934018Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9934266Z         self,
2025-05-07T20:32:58.9934468Z         T: int,
2025-05-07T20:32:58.9934668Z         D: int,
2025-05-07T20:32:58.9934892Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9935168Z         contiguous: bool,
2025-05-07T20:32:58.9935421Z         compiled: bool,
2025-05-07T20:32:58.9935652Z     ) -> None:
2025-05-07T20:32:58.9935879Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9936128Z     
2025-05-07T20:32:58.9936412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9936759Z     
2025-05-07T20:32:58.9936959Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9937255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9937573Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9937823Z         x0 = x[:, :D]
2025-05-07T20:32:58.9938053Z         x1 = x[:, D:]
2025-05-07T20:32:58.9938264Z     
2025-05-07T20:32:58.9938473Z         if contiguous:
2025-05-07T20:32:58.9938714Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9938984Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9939230Z     
2025-05-07T20:32:58.9939431Z         if scale_ub is not None:
2025-05-07T20:32:58.9939707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9940050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9940366Z             )
2025-05-07T20:32:58.9940563Z         else:
2025-05-07T20:32:58.9940790Z             scale_ub_tensor = None
2025-05-07T20:32:58.9941055Z     
2025-05-07T20:32:58.9941296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9941611Z             op = silu_mul_quant
2025-05-07T20:32:58.9941874Z             if compiled:
2025-05-07T20:32:58.9942135Z                 op = torch.compile(op)
2025-05-07T20:32:58.9942431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9942713Z     
2025-05-07T20:32:58.9942924Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9943095Z 
2025-05-07T20:32:58.9943199Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9943507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9943858Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9944146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9944849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9945548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9946094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9946780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9947452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9947999Z     kernel = self.compile(
2025-05-07T20:32:58.9948967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9949638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9950132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9950366Z 
2025-05-07T20:32:58.9950583Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630fa40>
2025-05-07T20:32:58.9951667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9953063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bbbc40>}
2025-05-07T20:32:58.9954423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9955456Z context = <triton._C.libtriton.ir.context object at 0x7f8a69b4e330>
2025-05-07T20:32:58.9955866Z 
2025-05-07T20:32:58.9956050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9956575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9957057Z                            module_map=module_map)
2025-05-07T20:32:58.9957425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9957780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9958049Z E       ^
2025-05-07T20:32:58.9958521Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9958974Z 
2025-05-07T20:32:58.9959399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9959914Z 
2025-05-07T20:32:58.9960022Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9960445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9960861Z     T=128,
2025-05-07T20:32:58.9961053Z     D=5120,
2025-05-07T20:32:58.9961261Z     scale_ub=None,
2025-05-07T20:32:58.9961490Z     contiguous=False,
2025-05-07T20:32:58.9961731Z     compiled=True,
2025-05-07T20:32:58.9961947Z )
2025-05-07T20:32:58.9962274Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9962765Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9974148Z 
2025-05-07T20:32:58.9974250Z     @given(
2025-05-07T20:32:58.9974513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9974832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9975162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9975494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9975820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9976120Z     )
2025-05-07T20:32:58.9976476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9976930Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9977179Z         self,
2025-05-07T20:32:58.9977388Z         T: int,
2025-05-07T20:32:58.9977597Z         D: int,
2025-05-07T20:32:58.9977817Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9978097Z         contiguous: bool,
2025-05-07T20:32:58.9978343Z         compiled: bool,
2025-05-07T20:32:58.9978569Z     ) -> None:
2025-05-07T20:32:58.9978800Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9979050Z     
2025-05-07T20:32:58.9979323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9979680Z     
2025-05-07T20:32:58.9980140Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9980436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9980755Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9981004Z         x0 = x[:, :D]
2025-05-07T20:32:58.9981283Z         x1 = x[:, D:]
2025-05-07T20:32:58.9981498Z     
2025-05-07T20:32:58.9981692Z         if contiguous:
2025-05-07T20:32:58.9981933Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9982189Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9982435Z     
2025-05-07T20:32:58.9982636Z         if scale_ub is not None:
2025-05-07T20:32:58.9982909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9983251Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9983565Z             )
2025-05-07T20:32:58.9983755Z         else:
2025-05-07T20:32:58.9983970Z             scale_ub_tensor = None
2025-05-07T20:32:58.9984226Z     
2025-05-07T20:32:58.9984486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9984948Z             op = silu_mul_quant
2025-05-07T20:32:58.9985296Z             if compiled:
2025-05-07T20:32:58.9985620Z                 op = torch.compile(op)
2025-05-07T20:32:58.9986021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9986411Z     
2025-05-07T20:32:58.9986675Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9986898Z 
2025-05-07T20:32:58.9987050Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9987515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9987988Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9988388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9989150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9989921Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9990583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9991276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9991817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9992504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9993160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9993697Z     kernel = self.compile(
2025-05-07T20:32:58.9994241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9994901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9995299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9995540Z 
2025-05-07T20:32:58.9995816Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b43c20>
2025-05-07T20:32:58.9996913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9998300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bb9da0>}
2025-05-07T20:32:58.9999637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0000661Z context = <triton._C.libtriton.ir.context object at 0x7f8b66296030>
2025-05-07T20:32:59.0000952Z 
2025-05-07T20:32:59.0001119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0001766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0002281Z                            module_map=module_map)
2025-05-07T20:32:59.0002650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0003054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0003317Z E       ^
2025-05-07T20:32:59.0003784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0004239Z 
2025-05-07T20:32:59.0004651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.0005159Z 
2025-05-07T20:32:59.0005272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.0005680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.0006086Z     T=128,
2025-05-07T20:32:59.0006280Z     D=7168,
2025-05-07T20:32:59.0006472Z     scale_ub=1200.0,
2025-05-07T20:32:59.0006707Z     contiguous=False,
2025-05-07T20:32:59.0006943Z     compiled=False,
2025-05-07T20:32:59.0007147Z )
2025-05-07T20:32:59.1163683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1165076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:59.1166120Z 
2025-05-07T20:32:59.1166347Z     @given(
2025-05-07T20:32:59.1166967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1167304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1167619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1167956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1168285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1168616Z     )
2025-05-07T20:32:59.1168970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1169422Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1169676Z         self,
2025-05-07T20:32:59.1169902Z         T: int,
2025-05-07T20:32:59.1170114Z         D: int,
2025-05-07T20:32:59.1170347Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1170615Z         contiguous: bool,
2025-05-07T20:32:59.1170866Z         compiled: bool,
2025-05-07T20:32:59.1171100Z     ) -> None:
2025-05-07T20:32:59.1171315Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1171565Z     
2025-05-07T20:32:59.1171848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1172198Z     
2025-05-07T20:32:59.1172395Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1172698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1173014Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1173253Z         x0 = x[:, :D]
2025-05-07T20:32:59.1173469Z         x1 = x[:, D:]
2025-05-07T20:32:59.1173679Z     
2025-05-07T20:32:59.1173867Z         if contiguous:
2025-05-07T20:32:59.1174102Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1174372Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1174612Z     
2025-05-07T20:32:59.1174807Z         if scale_ub is not None:
2025-05-07T20:32:59.1175084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1175419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1175731Z             )
2025-05-07T20:32:59.1175928Z         else:
2025-05-07T20:32:59.1176135Z             scale_ub_tensor = None
2025-05-07T20:32:59.1176391Z     
2025-05-07T20:32:59.1176627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1176948Z             op = silu_mul_quant
2025-05-07T20:32:59.1177220Z             if compiled:
2025-05-07T20:32:59.1177492Z                 op = torch.compile(op)
2025-05-07T20:32:59.1177796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1178069Z     
2025-05-07T20:32:59.1178270Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.1178434Z 
2025-05-07T20:32:59.1178545Z moe/activation_test.py:117: 
2025-05-07T20:32:59.1179266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1179609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.1179892Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1180655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.1181346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.1181885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1182571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1183228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1183766Z     kernel = self.compile(
2025-05-07T20:32:59.1184311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1184976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1185374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1185614Z 
2025-05-07T20:32:59.1185821Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6633adb0>
2025-05-07T20:32:59.1186914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1188304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a0ae0>}
2025-05-07T20:32:59.1189651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1190676Z context = <triton._C.libtriton.ir.context object at 0x7f8b6626c2b0>
2025-05-07T20:32:59.1190975Z 
2025-05-07T20:32:59.1191144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1191673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1192140Z                            module_map=module_map)
2025-05-07T20:32:59.1192505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1192870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.1193132Z E       ^
2025-05-07T20:32:59.1193604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1194063Z 
2025-05-07T20:32:59.1194473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1194984Z 
2025-05-07T20:32:59.1195102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1195514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1196024Z     T=128,
2025-05-07T20:32:59.1196222Z     D=5120,
2025-05-07T20:32:59.1196419Z     scale_ub=None,
2025-05-07T20:32:59.1196642Z     contiguous=False,
2025-05-07T20:32:59.1196878Z     compiled=False,
2025-05-07T20:32:59.1197096Z )
2025-05-07T20:32:59.1197460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1197952Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:59.1198222Z 
2025-05-07T20:32:59.1198306Z     @given(
2025-05-07T20:32:59.1198535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1198850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1199164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1199490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1199959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1200248Z     )
2025-05-07T20:32:59.1200598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1201081Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1201327Z         self,
2025-05-07T20:32:59.1201532Z         T: int,
2025-05-07T20:32:59.1201727Z         D: int,
2025-05-07T20:32:59.1201947Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1202222Z         contiguous: bool,
2025-05-07T20:32:59.1202461Z         compiled: bool,
2025-05-07T20:32:59.1202691Z     ) -> None:
2025-05-07T20:32:59.1202917Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1203156Z     
2025-05-07T20:32:59.1203431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1203786Z     
2025-05-07T20:32:59.1203981Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1204282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1204607Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1204846Z         x0 = x[:, :D]
2025-05-07T20:32:59.1205072Z         x1 = x[:, D:]
2025-05-07T20:32:59.1205293Z     
2025-05-07T20:32:59.1205486Z         if contiguous:
2025-05-07T20:32:59.1205729Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1205994Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1206241Z     
2025-05-07T20:32:59.1206437Z         if scale_ub is not None:
2025-05-07T20:32:59.1206714Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1207060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1207413Z             )
2025-05-07T20:32:59.1207624Z         else:
2025-05-07T20:32:59.1207849Z             scale_ub_tensor = None
2025-05-07T20:32:59.1208102Z     
2025-05-07T20:32:59.1208341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1208660Z             op = silu_mul_quant
2025-05-07T20:32:59.1208906Z             if compiled:
2025-05-07T20:32:59.1209174Z                 op = torch.compile(op)
2025-05-07T20:32:59.1209480Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1209754Z     
2025-05-07T20:32:59.1209952Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.1210120Z 
2025-05-07T20:32:59.1210227Z moe/activation_test.py:117: 
2025-05-07T20:32:59.1210528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1210864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.1211154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1211846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.1212532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.1213075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1213761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1214435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1214961Z     kernel = self.compile(
2025-05-07T20:32:59.1215508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1216164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1216556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1216795Z 
2025-05-07T20:32:59.1217011Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66338110>
2025-05-07T20:32:59.1218128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1219582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a1bc0>}
2025-05-07T20:32:59.1220963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1222058Z context = <triton._C.libtriton.ir.context object at 0x7f8a69ae8b70>
2025-05-07T20:32:59.1222353Z 
2025-05-07T20:32:59.1222522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1223047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1223519Z                            module_map=module_map)
2025-05-07T20:32:59.1223885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1224252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.1224521Z E       ^
2025-05-07T20:32:59.1224994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1225453Z 
2025-05-07T20:32:59.1225864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1226395Z 
2025-05-07T20:32:59.1226500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1226927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1227379Z     T=128,
2025-05-07T20:32:59.1227575Z     D=5120,
2025-05-07T20:32:59.1227786Z     scale_ub=1200.0,
2025-05-07T20:32:59.1228006Z     contiguous=True,
2025-05-07T20:32:59.1228233Z     compiled=False,
2025-05-07T20:32:59.1228448Z )
2025-05-07T20:32:59.2982401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.2983289Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.2983758Z 
2025-05-07T20:32:59.2983904Z     @given(
2025-05-07T20:32:59.2984258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.2984762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.2985257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.2985785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.2986301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.2986758Z     )
2025-05-07T20:32:59.2987324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.2988006Z     def test_silu_mul_quant(
2025-05-07T20:32:59.2988376Z         self,
2025-05-07T20:32:59.2988681Z         T: int,
2025-05-07T20:32:59.2988982Z         D: int,
2025-05-07T20:32:59.2989318Z         scale_ub: Optional[float],
2025-05-07T20:32:59.2989743Z         contiguous: bool,
2025-05-07T20:32:59.2990110Z         compiled: bool,
2025-05-07T20:32:59.2990461Z     ) -> None:
2025-05-07T20:32:59.2990807Z         torch.manual_seed(2025)
2025-05-07T20:32:59.2991163Z     
2025-05-07T20:32:59.2991593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.2992161Z     
2025-05-07T20:32:59.2992470Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.2992941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.2993437Z         x = x_sign * x_clamp
2025-05-07T20:32:59.2993809Z         x0 = x[:, :D]
2025-05-07T20:32:59.2994165Z         x1 = x[:, D:]
2025-05-07T20:32:59.2994507Z     
2025-05-07T20:32:59.2994801Z         if contiguous:
2025-05-07T20:32:59.2995170Z             x0 = x0.contiguous()
2025-05-07T20:32:59.2995583Z             x1 = x1.contiguous()
2025-05-07T20:32:59.2996097Z     
2025-05-07T20:32:59.2996410Z         if scale_ub is not None:
2025-05-07T20:32:59.2996851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.2997384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.2997869Z             )
2025-05-07T20:32:59.2998729Z         else:
2025-05-07T20:32:59.2999065Z             scale_ub_tensor = None
2025-05-07T20:32:59.2999460Z     
2025-05-07T20:32:59.2999818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3000420Z             op = silu_mul_quant
2025-05-07T20:32:59.3000806Z             if compiled:
2025-05-07T20:32:59.3001194Z                 op = torch.compile(op)
2025-05-07T20:32:59.3001665Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3002107Z     
2025-05-07T20:32:59.3002428Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3002702Z 
2025-05-07T20:32:59.3002865Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3003310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3003767Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3004166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3005185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3006273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3007090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3008175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3009233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3010224Z     kernel = self.compile(
2025-05-07T20:32:59.3011135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3012294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3012952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3013359Z 
2025-05-07T20:32:59.3013702Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed4410>
2025-05-07T20:32:59.3015574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3017941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a2ca0>}
2025-05-07T20:32:59.3020062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3021722Z context = <triton._C.libtriton.ir.context object at 0x7f8a69e38270>
2025-05-07T20:32:59.3022194Z 
2025-05-07T20:32:59.3022470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3023329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3024117Z                            module_map=module_map)
2025-05-07T20:32:59.3024696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3025274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3025698Z E       ^
2025-05-07T20:32:59.3026406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3027163Z 
2025-05-07T20:32:59.3027861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3028671Z 
2025-05-07T20:32:59.3028829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3029426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3030106Z     T=1,
2025-05-07T20:32:59.3030416Z     D=7168,
2025-05-07T20:32:59.3030743Z     scale_ub=1200.0,
2025-05-07T20:32:59.3031371Z     contiguous=True,
2025-05-07T20:32:59.3031748Z     compiled=True,
2025-05-07T20:32:59.3032096Z )
2025-05-07T20:32:59.3032619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3033520Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.3033974Z 
2025-05-07T20:32:59.3034107Z     @given(
2025-05-07T20:32:59.3034480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3035009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3035607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3036237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3036795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3037286Z     )
2025-05-07T20:32:59.3037888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3038652Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3039063Z         self,
2025-05-07T20:32:59.3039404Z         T: int,
2025-05-07T20:32:59.3039724Z         D: int,
2025-05-07T20:32:59.3040114Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3040573Z         contiguous: bool,
2025-05-07T20:32:59.3040958Z         compiled: bool,
2025-05-07T20:32:59.3041313Z     ) -> None:
2025-05-07T20:32:59.3041663Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3042038Z     
2025-05-07T20:32:59.3042469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3043026Z     
2025-05-07T20:32:59.3043325Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3043802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3044336Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3044729Z         x0 = x[:, :D]
2025-05-07T20:32:59.3058056Z         x1 = x[:, D:]
2025-05-07T20:32:59.3058472Z     
2025-05-07T20:32:59.3058797Z         if contiguous:
2025-05-07T20:32:59.3059179Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3059639Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3060051Z     
2025-05-07T20:32:59.3060367Z         if scale_ub is not None:
2025-05-07T20:32:59.3060839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3061433Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3061903Z             )
2025-05-07T20:32:59.3062175Z         else:
2025-05-07T20:32:59.3062455Z             scale_ub_tensor = None
2025-05-07T20:32:59.3062783Z     
2025-05-07T20:32:59.3063100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3063558Z             op = silu_mul_quant
2025-05-07T20:32:59.3063915Z             if compiled:
2025-05-07T20:32:59.3064280Z                 op = torch.compile(op)
2025-05-07T20:32:59.3064712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3065120Z     
2025-05-07T20:32:59.3065922Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3066184Z 
2025-05-07T20:32:59.3066340Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3066835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3067356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3067817Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3068704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.3069605Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.3070688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3071916Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3072855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3074048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3075513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3076636Z     kernel = self.compile(
2025-05-07T20:32:59.3077639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3078903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3079591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3079991Z 
2025-05-07T20:32:59.3080347Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed4c80>
2025-05-07T20:32:59.3082258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3084714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef4360>}
2025-05-07T20:32:59.3087116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3088943Z context = <triton._C.libtriton.ir.context object at 0x7f8a69e20130>
2025-05-07T20:32:59.3089443Z 
2025-05-07T20:32:59.3089730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3090629Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3091440Z                            module_map=module_map)
2025-05-07T20:32:59.3092056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3092646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3093085Z E       ^
2025-05-07T20:32:59.3093892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3094710Z 
2025-05-07T20:32:59.3095449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3096360Z 
2025-05-07T20:32:59.3096537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3097256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3098007Z     T=1,
2025-05-07T20:32:59.3098312Z     D=7168,
2025-05-07T20:32:59.3098640Z     scale_ub=1200.0,
2025-05-07T20:32:59.3099016Z     contiguous=False,
2025-05-07T20:32:59.3099396Z     compiled=True,
2025-05-07T20:32:59.3099731Z )
2025-05-07T20:32:59.4430425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4431294Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.4431734Z 
2025-05-07T20:32:59.4431868Z     @given(
2025-05-07T20:32:59.4432227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4432772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4433270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4433800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4434341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4434804Z     )
2025-05-07T20:32:59.4435375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4436224Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4436613Z         self,
2025-05-07T20:32:59.4436909Z         T: int,
2025-05-07T20:32:59.4437204Z         D: int,
2025-05-07T20:32:59.4437597Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4438029Z         contiguous: bool,
2025-05-07T20:32:59.4438401Z         compiled: bool,
2025-05-07T20:32:59.4438759Z     ) -> None:
2025-05-07T20:32:59.4439100Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4439481Z     
2025-05-07T20:32:59.4440333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4440996Z     
2025-05-07T20:32:59.4441309Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.4441794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.4442417Z         x = x_sign * x_clamp
2025-05-07T20:32:59.4442793Z         x0 = x[:, :D]
2025-05-07T20:32:59.4443134Z         x1 = x[:, D:]
2025-05-07T20:32:59.4443472Z     
2025-05-07T20:32:59.4443770Z         if contiguous:
2025-05-07T20:32:59.4444139Z             x0 = x0.contiguous()
2025-05-07T20:32:59.4444568Z             x1 = x1.contiguous()
2025-05-07T20:32:59.4444956Z     
2025-05-07T20:32:59.4445255Z         if scale_ub is not None:
2025-05-07T20:32:59.4445703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.4446242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.4446735Z             )
2025-05-07T20:32:59.4447039Z         else:
2025-05-07T20:32:59.4447377Z             scale_ub_tensor = None
2025-05-07T20:32:59.4447777Z     
2025-05-07T20:32:59.4448159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.4448675Z             op = silu_mul_quant
2025-05-07T20:32:59.4449073Z             if compiled:
2025-05-07T20:32:59.4449469Z                 op = torch.compile(op)
2025-05-07T20:32:59.4449948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.4450389Z     
2025-05-07T20:32:59.4450698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.4450974Z 
2025-05-07T20:32:59.4451136Z moe/activation_test.py:117: 
2025-05-07T20:32:59.4451615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.4452129Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.4452579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.4453504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.4454427Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.4455348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.4456375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.4457216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.4458318Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.4459372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.4460211Z     kernel = self.compile(
2025-05-07T20:32:59.4461123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.4462253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.4462946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.4463349Z 
2025-05-07T20:32:59.4463718Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed62d0>
2025-05-07T20:32:59.4465919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.4468252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef59e0>}
2025-05-07T20:32:59.4470475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.4472139Z context = <triton._C.libtriton.ir.context object at 0x7f8a69cf85f0>
2025-05-07T20:32:59.4472621Z 
2025-05-07T20:32:59.4472883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.4473991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.4474746Z                            module_map=module_map)
2025-05-07T20:32:59.4475429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.4476083Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.4476503Z E       ^
2025-05-07T20:32:59.4477248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.4477972Z 
2025-05-07T20:32:59.4478654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.4479521Z 
2025-05-07T20:32:59.4479685Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4480327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4480882Z     T=1,
2025-05-07T20:32:59.4481184Z     D=7168,
2025-05-07T20:32:59.4481518Z     scale_ub=None,
2025-05-07T20:32:59.4481875Z     contiguous=False,
2025-05-07T20:32:59.4482252Z     compiled=True,
2025-05-07T20:32:59.4482596Z )
2025-05-07T20:32:59.7144578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.7145473Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:59.7145912Z 
2025-05-07T20:32:59.7146036Z     @given(
2025-05-07T20:32:59.7146400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.7146916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.7147398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.7147939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.7148466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.7148893Z     )
2025-05-07T20:32:59.7149445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.7150175Z     def test_silu_mul_quant(
2025-05-07T20:32:59.7150558Z         self,
2025-05-07T20:32:59.7150870Z         T: int,
2025-05-07T20:32:59.7151179Z         D: int,
2025-05-07T20:32:59.7151526Z         scale_ub: Optional[float],
2025-05-07T20:32:59.7151952Z         contiguous: bool,
2025-05-07T20:32:59.7152329Z         compiled: bool,
2025-05-07T20:32:59.7152685Z     ) -> None:
2025-05-07T20:32:59.7153022Z         torch.manual_seed(2025)
2025-05-07T20:32:59.7153417Z     
2025-05-07T20:32:59.7153855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.7154409Z     
2025-05-07T20:32:59.7154721Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.7155185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.7155690Z         x = x_sign * x_clamp
2025-05-07T20:32:59.7156169Z         x0 = x[:, :D]
2025-05-07T20:32:59.7156528Z         x1 = x[:, D:]
2025-05-07T20:32:59.7156854Z     
2025-05-07T20:32:59.7157148Z         if contiguous:
2025-05-07T20:32:59.7157567Z             x0 = x0.contiguous()
2025-05-07T20:32:59.7157997Z             x1 = x1.contiguous()
2025-05-07T20:32:59.7158387Z     
2025-05-07T20:32:59.7158691Z         if scale_ub is not None:
2025-05-07T20:32:59.7159115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.7159656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.7160156Z             )
2025-05-07T20:32:59.7160460Z         else:
2025-05-07T20:32:59.7160786Z             scale_ub_tensor = None
2025-05-07T20:32:59.7161183Z     
2025-05-07T20:32:59.7161546Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.7162043Z             op = silu_mul_quant
2025-05-07T20:32:59.7162427Z             if compiled:
2025-05-07T20:32:59.7162826Z                 op = torch.compile(op)
2025-05-07T20:32:59.7163296Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7163733Z     
2025-05-07T20:32:59.7164051Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.7164961Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.7165848Z     
2025-05-07T20:32:59.7166200Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.7166672Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.7167215Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.7167684Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.7168217Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.7168691Z     
2025-05-07T20:32:59.7169006Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.7169321Z 
2025-05-07T20:32:59.7169488Z moe/activation_test.py:126: 
2025-05-07T20:32:59.7169953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7170441Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.7170931Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.7172171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.7173393Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.7174246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.7175336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.7176494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.7177783Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.7179078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.7180191Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.7181243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.7182113Z     fn()
2025-05-07T20:32:59.7182922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.7183835Z     self.fn.run(
2025-05-07T20:32:59.7184635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.7185482Z     kernel = self.compile(
2025-05-07T20:32:59.7186347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.7187403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.7188051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7188432Z 
2025-05-07T20:32:59.7188771Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed79b0>
2025-05-07T20:32:59.7190577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.7192831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef6700>}
2025-05-07T20:32:59.7194992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.7196789Z context = <triton._C.libtriton.ir.context object at 0x7f8a69abb830>
2025-05-07T20:32:59.7197299Z 
2025-05-07T20:32:59.7197586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.7198493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.7199508Z                            module_map=module_map)
2025-05-07T20:32:59.7200210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.7200818Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.7201332Z E       ^
2025-05-07T20:32:59.7202140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.7202941Z 
2025-05-07T20:32:59.7203686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.7204603Z 
2025-05-07T20:32:59.7204782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.7205489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.7206188Z     T=1,
2025-05-07T20:32:59.7206503Z     D=5120,
2025-05-07T20:32:59.7206812Z     scale_ub=1200.0,
2025-05-07T20:32:59.7207175Z     contiguous=False,
2025-05-07T20:32:59.7207540Z     compiled=True,
2025-05-07T20:32:59.7207872Z )
2025-05-07T20:32:59.8749987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8750879Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.8751335Z 
2025-05-07T20:32:59.8751460Z     @given(
2025-05-07T20:32:59.8751833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8752336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8752836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8753366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8753904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8754345Z     )
2025-05-07T20:32:59.8754901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8755620Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8756090Z         self,
2025-05-07T20:32:59.8756397Z         T: int,
2025-05-07T20:32:59.8756705Z         D: int,
2025-05-07T20:32:59.8757058Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8757500Z         contiguous: bool,
2025-05-07T20:32:59.8757884Z         compiled: bool,
2025-05-07T20:32:59.8758245Z     ) -> None:
2025-05-07T20:32:59.8758598Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8759007Z     
2025-05-07T20:32:59.8759443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8760001Z     
2025-05-07T20:32:59.8760313Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8760776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8761283Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8761686Z         x0 = x[:, :D]
2025-05-07T20:32:59.8762039Z         x1 = x[:, D:]
2025-05-07T20:32:59.8762366Z     
2025-05-07T20:32:59.8762670Z         if contiguous:
2025-05-07T20:32:59.8763047Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8763453Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8763843Z     
2025-05-07T20:32:59.8764160Z         if scale_ub is not None:
2025-05-07T20:32:59.8764595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8765137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8765981Z             )
2025-05-07T20:32:59.8766282Z         else:
2025-05-07T20:32:59.8766620Z             scale_ub_tensor = None
2025-05-07T20:32:59.8767023Z     
2025-05-07T20:32:59.8767393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8767926Z             op = silu_mul_quant
2025-05-07T20:32:59.8768322Z             if compiled:
2025-05-07T20:32:59.8768717Z                 op = torch.compile(op)
2025-05-07T20:32:59.8769193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8769626Z     
2025-05-07T20:32:59.8769944Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8770199Z 
2025-05-07T20:32:59.8770357Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8770763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8771766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8772208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8773096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.8774087Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.8775125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8776224Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8777150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8778373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8779528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8780455Z     kernel = self.compile(
2025-05-07T20:32:59.8781416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8782493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8783128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8783473Z 
2025-05-07T20:32:59.8783813Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aeca40>
2025-05-07T20:32:59.8785575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8787830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef7e20>}
2025-05-07T20:32:59.8790025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8791739Z context = <triton._C.libtriton.ir.context object at 0x7f8a69d2bef0>
2025-05-07T20:32:59.8792227Z 
2025-05-07T20:32:59.8792483Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8793321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8794130Z                            module_map=module_map)
2025-05-07T20:32:59.8794684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8795226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8795590Z E       ^
2025-05-07T20:32:59.8796447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8797255Z 
2025-05-07T20:32:59.8798060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8798992Z 
2025-05-07T20:32:59.8799163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8799870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8800578Z     T=1,
2025-05-07T20:32:59.8800871Z     D=5120,
2025-05-07T20:32:59.8801195Z     scale_ub=1200.0,
2025-05-07T20:32:59.8801569Z     contiguous=False,
2025-05-07T20:32:59.8801936Z     compiled=False,
2025-05-07T20:32:59.8802287Z )
2025-05-07T20:32:59.8802829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8803672Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:59.8804148Z 
2025-05-07T20:32:59.8804277Z     @given(
2025-05-07T20:32:59.8804657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8805179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8805858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8806494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8807054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8807571Z     )
2025-05-07T20:32:59.8808191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8808906Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8809281Z         self,
2025-05-07T20:32:59.8809598Z         T: int,
2025-05-07T20:32:59.8809913Z         D: int,
2025-05-07T20:32:59.8810256Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8810714Z         contiguous: bool,
2025-05-07T20:32:59.8811112Z         compiled: bool,
2025-05-07T20:32:59.8811476Z     ) -> None:
2025-05-07T20:32:59.8811835Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8812243Z     
2025-05-07T20:32:59.8812690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8813279Z     
2025-05-07T20:32:59.8813609Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8814094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8814615Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8815015Z         x0 = x[:, :D]
2025-05-07T20:32:59.8815379Z         x1 = x[:, D:]
2025-05-07T20:32:59.8815733Z     
2025-05-07T20:32:59.8816035Z         if contiguous:
2025-05-07T20:32:59.8816411Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8816850Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8817261Z     
2025-05-07T20:32:59.8817578Z         if scale_ub is not None:
2025-05-07T20:32:59.8818045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8818614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8819145Z             )
2025-05-07T20:32:59.8819460Z         else:
2025-05-07T20:32:59.8819813Z             scale_ub_tensor = None
2025-05-07T20:32:59.8820239Z     
2025-05-07T20:32:59.8820620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8821167Z             op = silu_mul_quant
2025-05-07T20:32:59.8821580Z             if compiled:
2025-05-07T20:32:59.8821985Z                 op = torch.compile(op)
2025-05-07T20:32:59.8822483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8822953Z     
2025-05-07T20:32:59.8823269Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8823554Z 
2025-05-07T20:32:59.8823719Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8836420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8836980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8837434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8838593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8839707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8840563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8841649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8842677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8843480Z     kernel = self.compile(
2025-05-07T20:32:59.8844333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8845352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8845966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8846335Z 
2025-05-07T20:32:59.8846672Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630c560>
2025-05-07T20:32:59.8848505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8850799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66666480>}
2025-05-07T20:32:59.8853218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8855045Z context = <triton._C.libtriton.ir.context object at 0x7f8b666d8eb0>
2025-05-07T20:32:59.8855540Z 
2025-05-07T20:32:59.8855836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8856740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8857599Z                            module_map=module_map)
2025-05-07T20:32:59.8858241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8858858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8859306Z E       ^
2025-05-07T20:32:59.8860116Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8860921Z 
2025-05-07T20:32:59.8861670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8862582Z 
2025-05-07T20:32:59.8862756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8863477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8864179Z     T=16384,
2025-05-07T20:32:59.8864494Z     D=5120,
2025-05-07T20:32:59.8864823Z     scale_ub=1200.0,
2025-05-07T20:32:59.8865206Z     contiguous=False,
2025-05-07T20:32:59.8865929Z     compiled=True,
2025-05-07T20:32:59.8866274Z )
2025-05-07T20:32:59.9730900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.9731887Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.9732389Z 
2025-05-07T20:32:59.9732521Z     @given(
2025-05-07T20:32:59.9732911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.9733448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.9733957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.9734521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.9735076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.9735568Z     )
2025-05-07T20:32:59.9736171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.9736928Z     def test_silu_mul_quant(
2025-05-07T20:32:59.9737340Z         self,
2025-05-07T20:32:59.9737667Z         T: int,
2025-05-07T20:32:59.9737990Z         D: int,
2025-05-07T20:32:59.9738358Z         scale_ub: Optional[float],
2025-05-07T20:32:59.9738826Z         contiguous: bool,
2025-05-07T20:32:59.9739244Z         compiled: bool,
2025-05-07T20:32:59.9739624Z     ) -> None:
2025-05-07T20:32:59.9739992Z         torch.manual_seed(2025)
2025-05-07T20:32:59.9740409Z     
2025-05-07T20:32:59.9740858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.9741453Z     
2025-05-07T20:32:59.9741773Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.9742231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.9742729Z         x = x_sign * x_clamp
2025-05-07T20:32:59.9743117Z         x0 = x[:, :D]
2025-05-07T20:32:59.9743463Z         x1 = x[:, D:]
2025-05-07T20:32:59.9743793Z     
2025-05-07T20:32:59.9744094Z         if contiguous:
2025-05-07T20:32:59.9744466Z             x0 = x0.contiguous()
2025-05-07T20:32:59.9744907Z             x1 = x1.contiguous()
2025-05-07T20:32:59.9745324Z     
2025-05-07T20:32:59.9745635Z         if scale_ub is not None:
2025-05-07T20:32:59.9746095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.9747199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.9747733Z             )
2025-05-07T20:32:59.9748058Z         else:
2025-05-07T20:32:59.9748409Z             scale_ub_tensor = None
2025-05-07T20:32:59.9748937Z     
2025-05-07T20:32:59.9749316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.9749858Z             op = silu_mul_quant
2025-05-07T20:32:59.9750271Z             if compiled:
2025-05-07T20:32:59.9750672Z                 op = torch.compile(op)
2025-05-07T20:32:59.9751176Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9751648Z     
2025-05-07T20:32:59.9751961Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.9752253Z 
2025-05-07T20:32:59.9752418Z moe/activation_test.py:117: 
2025-05-07T20:32:59.9752925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9753486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.9753972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9754963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.9756076Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.9757237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.9758514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.9759462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.9760652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.9761824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.9762773Z     kernel = self.compile(
2025-05-07T20:32:59.9763640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.9764512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.9765081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9765719Z 
2025-05-07T20:32:59.9766043Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630c860>
2025-05-07T20:32:59.9767678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.9769911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66667ce0>}
2025-05-07T20:32:59.9772176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.9773943Z context = <triton._C.libtriton.ir.context object at 0x7f8a69db1db0>
2025-05-07T20:32:59.9774447Z 
2025-05-07T20:32:59.9774737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.9775617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.9776368Z                            module_map=module_map)
2025-05-07T20:32:59.9776933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.9777476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.9777879Z E       ^
2025-05-07T20:32:59.9778624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.9779353Z 
2025-05-07T20:32:59.9779995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.9780812Z 
2025-05-07T20:32:59.9781318Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.9781975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.9782619Z     T=2048,
2025-05-07T20:32:59.9783006Z     D=7168,
2025-05-07T20:32:59.9783279Z     scale_ub=1200.0,
2025-05-07T20:32:59.9783609Z     contiguous=False,
2025-05-07T20:32:59.9783949Z     compiled=True,
2025-05-07T20:32:59.9784264Z )
2025-05-07T20:32:59.9784749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.9785506Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.9785936Z 
2025-05-07T20:32:59.9786054Z     @given(
2025-05-07T20:32:59.9786403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.9786876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.9787341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.9787889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.9788415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.9788858Z     )
2025-05-07T20:32:59.9789389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.9790077Z     def test_silu_mul_quant(
2025-05-07T20:32:59.9790444Z         self,
2025-05-07T20:32:59.9790746Z         T: int,
2025-05-07T20:32:59.9791049Z         D: int,
2025-05-07T20:32:59.9791374Z         scale_ub: Optional[float],
2025-05-07T20:32:59.9791797Z         contiguous: bool,
2025-05-07T20:32:59.9792173Z         compiled: bool,
2025-05-07T20:32:59.9792501Z     ) -> None:
2025-05-07T20:32:59.9792830Z         torch.manual_seed(2025)
2025-05-07T20:32:59.9793202Z     
2025-05-07T20:32:59.9793609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.9794130Z     
2025-05-07T20:32:59.9794424Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.9794872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.9795352Z         x = x_sign * x_clamp
2025-05-07T20:32:59.9795716Z         x0 = x[:, :D]
2025-05-07T20:32:59.9796150Z         x1 = x[:, D:]
2025-05-07T20:32:59.9796454Z     
2025-05-07T20:32:59.9796732Z         if contiguous:
2025-05-07T20:32:59.9797090Z             x0 = x0.contiguous()
2025-05-07T20:32:59.9797472Z             x1 = x1.contiguous()
2025-05-07T20:32:59.9797870Z     
2025-05-07T20:32:59.9798182Z         if scale_ub is not None:
2025-05-07T20:32:59.9798598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.9799112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.9799592Z             )
2025-05-07T20:32:59.9799883Z         else:
2025-05-07T20:32:59.9800206Z             scale_ub_tensor = None
2025-05-07T20:32:59.9800600Z     
2025-05-07T20:32:59.9800947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.9801454Z             op = silu_mul_quant
2025-05-07T20:32:59.9801856Z             if compiled:
2025-05-07T20:32:59.9802268Z                 op = torch.compile(op)
2025-05-07T20:32:59.9802767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9803229Z     
2025-05-07T20:32:59.9803552Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.9803837Z 
2025-05-07T20:32:59.9804001Z moe/activation_test.py:117: 
2025-05-07T20:32:59.9804491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9805049Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.9805507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9806457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.9807454Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.9808544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.9809664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.9810750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.9812049Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.9813265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.9814195Z     kernel = self.compile(
2025-05-07T20:32:59.9815141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.9816277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.9816950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9817355Z 
2025-05-07T20:32:59.9817745Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619cf50>
2025-05-07T20:32:59.9819662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.9822123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b666679c0>}
2025-05-07T20:32:59.9824506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.9826309Z context = <triton._C.libtriton.ir.context object at 0x7f8a69db5ab0>
2025-05-07T20:32:59.9826815Z 
2025-05-07T20:32:59.9827092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.9828050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.9828854Z                            module_map=module_map)
2025-05-07T20:32:59.9829480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.9830077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.9830515Z E       ^
2025-05-07T20:32:59.9831316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.9832130Z 
2025-05-07T20:32:59.9832857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.9833768Z 
2025-05-07T20:33:00.1003656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1004459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1005159Z     T=1,
2025-05-07T20:33:00.1005475Z     D=5120,
2025-05-07T20:33:00.1005802Z     scale_ub=None,
2025-05-07T20:33:00.1006155Z     contiguous=False,
2025-05-07T20:33:00.1006532Z     compiled=False,
2025-05-07T20:33:00.1006878Z )
2025-05-07T20:33:00.1007444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1008287Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.1008755Z 
2025-05-07T20:33:00.1008883Z     @given(
2025-05-07T20:33:00.1009281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1009804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1010335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1010906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1011465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1011960Z     )
2025-05-07T20:33:00.1012570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1013345Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1013748Z         self,
2025-05-07T20:33:00.1014076Z         T: int,
2025-05-07T20:33:00.1014404Z         D: int,
2025-05-07T20:33:00.1014765Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1015657Z         contiguous: bool,
2025-05-07T20:33:00.1016059Z         compiled: bool,
2025-05-07T20:33:00.1016406Z     ) -> None:
2025-05-07T20:33:00.1016756Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1017281Z     
2025-05-07T20:33:00.1017739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1018359Z     
2025-05-07T20:33:00.1018681Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.1019157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.1019685Z         x = x_sign * x_clamp
2025-05-07T20:33:00.1020086Z         x0 = x[:, :D]
2025-05-07T20:33:00.1020435Z         x1 = x[:, D:]
2025-05-07T20:33:00.1020777Z     
2025-05-07T20:33:00.1021084Z         if contiguous:
2025-05-07T20:33:00.1021462Z             x0 = x0.contiguous()
2025-05-07T20:33:00.1021897Z             x1 = x1.contiguous()
2025-05-07T20:33:00.1022292Z     
2025-05-07T20:33:00.1022610Z         if scale_ub is not None:
2025-05-07T20:33:00.1023089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.1023654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.1024181Z             )
2025-05-07T20:33:00.1024492Z         else:
2025-05-07T20:33:00.1024847Z             scale_ub_tensor = None
2025-05-07T20:33:00.1025271Z     
2025-05-07T20:33:00.1025657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.1026203Z             op = silu_mul_quant
2025-05-07T20:33:00.1026629Z             if compiled:
2025-05-07T20:33:00.1027041Z                 op = torch.compile(op)
2025-05-07T20:33:00.1027545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1028021Z     
2025-05-07T20:33:00.1028331Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.1028622Z 
2025-05-07T20:33:00.1028793Z moe/activation_test.py:117: 
2025-05-07T20:33:00.1029296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1029873Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.1030354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1031571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.1032798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.1033733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.1034943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.1036217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.1036986Z     kernel = self.compile(
2025-05-07T20:33:00.1037704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.1038604Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.1039168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1039490Z 
2025-05-07T20:33:00.1039763Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619c200>
2025-05-07T20:33:00.1041240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.1043233Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668dd800>}
2025-05-07T20:33:00.1045153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.1046653Z context = <triton._C.libtriton.ir.context object at 0x7f8a69994930>
2025-05-07T20:33:00.1047251Z 
2025-05-07T20:33:00.1047496Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.1048304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.1049079Z                            module_map=module_map)
2025-05-07T20:33:00.1049638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.1050146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.1050561Z E       ^
2025-05-07T20:33:00.1051299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.1052016Z 
2025-05-07T20:33:00.1052674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.1053510Z 
2025-05-07T20:33:00.1053672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1054368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1055073Z     T=4096,
2025-05-07T20:33:00.1055376Z     D=7168,
2025-05-07T20:33:00.1055699Z     scale_ub=1200.0,
2025-05-07T20:33:00.1056067Z     contiguous=False,
2025-05-07T20:33:00.1056456Z     compiled=False,
2025-05-07T20:33:00.1056798Z )
2025-05-07T20:33:00.1057335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1058238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.1058719Z 
2025-05-07T20:33:00.1058851Z     @given(
2025-05-07T20:33:00.1059233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1059766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1060282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1060842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1061406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1061896Z     )
2025-05-07T20:33:00.1062495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1063263Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1063669Z         self,
2025-05-07T20:33:00.1063991Z         T: int,
2025-05-07T20:33:00.1064319Z         D: int,
2025-05-07T20:33:00.1064680Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1065130Z         contiguous: bool,
2025-05-07T20:33:00.1065916Z         compiled: bool,
2025-05-07T20:33:00.1066297Z     ) -> None:
2025-05-07T20:33:00.1066653Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1067063Z     
2025-05-07T20:33:00.1067515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1068090Z     
2025-05-07T20:33:00.1068410Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.1068899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.1069420Z         x = x_sign * x_clamp
2025-05-07T20:33:00.1069819Z         x0 = x[:, :D]
2025-05-07T20:33:00.1070182Z         x1 = x[:, D:]
2025-05-07T20:33:00.1070540Z     
2025-05-07T20:33:00.1070841Z         if contiguous:
2025-05-07T20:33:00.1071230Z             x0 = x0.contiguous()
2025-05-07T20:33:00.1071666Z             x1 = x1.contiguous()
2025-05-07T20:33:00.1072073Z     
2025-05-07T20:33:00.1072393Z         if scale_ub is not None:
2025-05-07T20:33:00.1072856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.1073413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.1073938Z             )
2025-05-07T20:33:00.1074263Z         else:
2025-05-07T20:33:00.1074608Z             scale_ub_tensor = None
2025-05-07T20:33:00.1075037Z     
2025-05-07T20:33:00.1075423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.1076048Z             op = silu_mul_quant
2025-05-07T20:33:00.1076473Z             if compiled:
2025-05-07T20:33:00.1076888Z                 op = torch.compile(op)
2025-05-07T20:33:00.1077378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1078199Z     
2025-05-07T20:33:00.1078544Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.1078827Z 
2025-05-07T20:33:00.1079003Z moe/activation_test.py:117: 
2025-05-07T20:33:00.1079493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1080155Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.1080628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1081830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.1083043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.1083978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.1085175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.1086330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.1087272Z     kernel = self.compile(
2025-05-07T20:33:00.1088264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.1089408Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.1090083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1090490Z 
2025-05-07T20:33:00.1090833Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66ba8d70>
2025-05-07T20:33:00.1092742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.1095195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66e85300>}
2025-05-07T20:33:00.1097601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.1099417Z context = <triton._C.libtriton.ir.context object at 0x7f8a69c36070>
2025-05-07T20:33:00.1099923Z 
2025-05-07T20:33:00.1100209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.1101118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.1101928Z                            module_map=module_map)
2025-05-07T20:33:00.1102548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.1103172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.1103612Z E       ^
2025-05-07T20:33:00.1104415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.1105228Z 
2025-05-07T20:33:00.1105968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.1106879Z 
2025-05-07T20:33:00.1107069Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1107775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1108526Z     T=16384,
2025-05-07T20:33:00.1108850Z     D=7168,
2025-05-07T20:33:00.1109165Z     scale_ub=None,
2025-05-07T20:33:00.1109526Z     contiguous=True,
2025-05-07T20:33:00.1109899Z     compiled=True,
2025-05-07T20:33:00.1110234Z )
2025-05-07T20:33:00.2884838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2885782Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.2886265Z 
2025-05-07T20:33:00.2886394Z     @given(
2025-05-07T20:33:00.2886778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2887881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2888408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2901786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2902544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2903037Z     )
2025-05-07T20:33:00.2903629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2904400Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2904804Z         self,
2025-05-07T20:33:00.2905134Z         T: int,
2025-05-07T20:33:00.2905469Z         D: int,
2025-05-07T20:33:00.2905828Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2906294Z         contiguous: bool,
2025-05-07T20:33:00.2906704Z         compiled: bool,
2025-05-07T20:33:00.2907084Z     ) -> None:
2025-05-07T20:33:00.2907439Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2907855Z     
2025-05-07T20:33:00.2908320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2908921Z     
2025-05-07T20:33:00.2909244Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2909738Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2910269Z         x = x_sign * x_clamp
2025-05-07T20:33:00.2910675Z         x0 = x[:, :D]
2025-05-07T20:33:00.2911042Z         x1 = x[:, D:]
2025-05-07T20:33:00.2911379Z     
2025-05-07T20:33:00.2911688Z         if contiguous:
2025-05-07T20:33:00.2912080Z             x0 = x0.contiguous()
2025-05-07T20:33:00.2912508Z             x1 = x1.contiguous()
2025-05-07T20:33:00.2912919Z     
2025-05-07T20:33:00.2913239Z         if scale_ub is not None:
2025-05-07T20:33:00.2913697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.2914270Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.2914797Z             )
2025-05-07T20:33:00.2915112Z         else:
2025-05-07T20:33:00.2915468Z             scale_ub_tensor = None
2025-05-07T20:33:00.2916028Z     
2025-05-07T20:33:00.2916426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2916883Z             op = silu_mul_quant
2025-05-07T20:33:00.2917203Z             if compiled:
2025-05-07T20:33:00.2917527Z                 op = torch.compile(op)
2025-05-07T20:33:00.2917969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2918362Z     
2025-05-07T20:33:00.2918637Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.2918857Z 
2025-05-07T20:33:00.2918993Z moe/activation_test.py:117: 
2025-05-07T20:33:00.2919392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2919860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.2920243Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2920998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.2921803Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.2922770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.2923727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.2924521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.2925504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.2926473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.2927237Z     kernel = self.compile(
2025-05-07T20:33:00.2928094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.2929131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.2929760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2930136Z 
2025-05-07T20:33:00.2930601Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666ab0e0>
2025-05-07T20:33:00.2932410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.2934681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90a65440>}
2025-05-07T20:33:00.2936834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.2938406Z context = <triton._C.libtriton.ir.context object at 0x7f8a69797530>
2025-05-07T20:33:00.2938864Z 
2025-05-07T20:33:00.2939120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.2939943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.2940703Z                            module_map=module_map)
2025-05-07T20:33:00.2941290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.2941832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.2942251Z E       ^
2025-05-07T20:33:00.2943030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.2943808Z 
2025-05-07T20:33:00.2944536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.2945441Z 
2025-05-07T20:33:00.2945613Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2946309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2946937Z     T=4096,
2025-05-07T20:33:00.2947223Z     D=5120,
2025-05-07T20:33:00.2947542Z     scale_ub=None,
2025-05-07T20:33:00.2947885Z     contiguous=False,
2025-05-07T20:33:00.2948238Z     compiled=True,
2025-05-07T20:33:00.2948553Z )
2025-05-07T20:33:00.2949060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2949849Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.2950285Z 
2025-05-07T20:33:00.2950404Z     @given(
2025-05-07T20:33:00.2950723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2951183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2951634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2952142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2952636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2953070Z     )
2025-05-07T20:33:00.2953609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2954274Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2954656Z         self,
2025-05-07T20:33:00.2954948Z         T: int,
2025-05-07T20:33:00.2955230Z         D: int,
2025-05-07T20:33:00.2955553Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2956089Z         contiguous: bool,
2025-05-07T20:33:00.2956453Z         compiled: bool,
2025-05-07T20:33:00.2956804Z     ) -> None:
2025-05-07T20:33:00.2957128Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2957496Z     
2025-05-07T20:33:00.2957901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2958435Z     
2025-05-07T20:33:00.2958738Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2959177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2959667Z         x = x_sign * x_clamp
2025-05-07T20:33:00.2960044Z         x0 = x[:, :D]
2025-05-07T20:33:00.2960367Z         x1 = x[:, D:]
2025-05-07T20:33:00.2960691Z     
2025-05-07T20:33:00.2960981Z         if contiguous:
2025-05-07T20:33:00.2961483Z             x0 = x0.contiguous()
2025-05-07T20:33:00.2961939Z             x1 = x1.contiguous()
2025-05-07T20:33:00.2962302Z     
2025-05-07T20:33:00.2962587Z         if scale_ub is not None:
2025-05-07T20:33:00.2963010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.2963592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.2964063Z             )
2025-05-07T20:33:00.2964357Z         else:
2025-05-07T20:33:00.2964677Z             scale_ub_tensor = None
2025-05-07T20:33:00.2965051Z     
2025-05-07T20:33:00.2965782Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2966243Z             op = silu_mul_quant
2025-05-07T20:33:00.2966580Z             if compiled:
2025-05-07T20:33:00.2966909Z                 op = torch.compile(op)
2025-05-07T20:33:00.2967289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2967650Z     
2025-05-07T20:33:00.2967854Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.2968022Z 
2025-05-07T20:33:00.2968144Z moe/activation_test.py:117: 
2025-05-07T20:33:00.2968436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2968774Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.2969067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2969624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.2970192Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.2970848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.2971533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.2972063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.2972746Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.2973421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.2973951Z     kernel = self.compile(
2025-05-07T20:33:00.2974496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.2975157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.2975565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2975796Z 
2025-05-07T20:33:00.2976002Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666a9a00>
2025-05-07T20:33:00.2977088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.2978523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90283060>}
2025-05-07T20:33:00.2979867Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.2980896Z context = <triton._C.libtriton.ir.context object at 0x7f8a696d4c30>
2025-05-07T20:33:00.2981190Z 
2025-05-07T20:33:00.2981361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.2981894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.2982373Z                            module_map=module_map)
2025-05-07T20:33:00.2982741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.2983103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.2983369Z E       ^
2025-05-07T20:33:00.2984020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.2984534Z 
2025-05-07T20:33:00.2984948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.2985522Z 
2025-05-07T20:33:00.4453885Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4454331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4454774Z     T=4096,
2025-05-07T20:33:00.4454969Z     D=5120,
2025-05-07T20:33:00.4455162Z     scale_ub=1200.0,
2025-05-07T20:33:00.4455384Z     contiguous=False,
2025-05-07T20:33:00.4455613Z     compiled=False,
2025-05-07T20:33:00.4455825Z )
2025-05-07T20:33:00.4456142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4456645Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.4456922Z 
2025-05-07T20:33:00.4457009Z     @given(
2025-05-07T20:33:00.4457260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4457585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4457899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4458240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4458569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4458857Z     )
2025-05-07T20:33:00.4459206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4459647Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4459891Z         self,
2025-05-07T20:33:00.4460093Z         T: int,
2025-05-07T20:33:00.4460287Z         D: int,
2025-05-07T20:33:00.4460510Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4460787Z         contiguous: bool,
2025-05-07T20:33:00.4461024Z         compiled: bool,
2025-05-07T20:33:00.4461262Z     ) -> None:
2025-05-07T20:33:00.4461491Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4461734Z     
2025-05-07T20:33:00.4462018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4462368Z     
2025-05-07T20:33:00.4462559Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.4462860Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.4463183Z         x = x_sign * x_clamp
2025-05-07T20:33:00.4463432Z         x0 = x[:, :D]
2025-05-07T20:33:00.4463646Z         x1 = x[:, D:]
2025-05-07T20:33:00.4463857Z     
2025-05-07T20:33:00.4464049Z         if contiguous:
2025-05-07T20:33:00.4464278Z             x0 = x0.contiguous()
2025-05-07T20:33:00.4464544Z             x1 = x1.contiguous()
2025-05-07T20:33:00.4464788Z     
2025-05-07T20:33:00.4464978Z         if scale_ub is not None:
2025-05-07T20:33:00.4465263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.4465844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.4466152Z             )
2025-05-07T20:33:00.4466351Z         else:
2025-05-07T20:33:00.4466571Z             scale_ub_tensor = None
2025-05-07T20:33:00.4466821Z     
2025-05-07T20:33:00.4467057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4467376Z             op = silu_mul_quant
2025-05-07T20:33:00.4467625Z             if compiled:
2025-05-07T20:33:00.4467875Z                 op = torch.compile(op)
2025-05-07T20:33:00.4468170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4468449Z     
2025-05-07T20:33:00.4468640Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.4468811Z 
2025-05-07T20:33:00.4468913Z moe/activation_test.py:117: 
2025-05-07T20:33:00.4469210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4469544Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.4469828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4470513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.4471553Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.4472163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.4472846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.4473595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.4474122Z     kernel = self.compile(
2025-05-07T20:33:00.4474662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.4475316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4475723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4476043Z 
2025-05-07T20:33:00.4476247Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666a93d0>
2025-05-07T20:33:00.4477331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.4478723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b1b20>}
2025-05-07T20:33:00.4480061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.4481076Z context = <triton._C.libtriton.ir.context object at 0x7f8a6990c770>
2025-05-07T20:33:00.4481374Z 
2025-05-07T20:33:00.4481542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.4482069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4482550Z                            module_map=module_map)
2025-05-07T20:33:00.4482911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4483277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4483543Z E       ^
2025-05-07T20:33:00.4484003Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4484463Z 
2025-05-07T20:33:00.4484880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4485395Z 
2025-05-07T20:33:00.4485499Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4485922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4486325Z     T=4096,
2025-05-07T20:33:00.4486520Z     D=5120,
2025-05-07T20:33:00.4486718Z     scale_ub=1200.0,
2025-05-07T20:33:00.4486953Z     contiguous=False,
2025-05-07T20:33:00.4487185Z     compiled=True,
2025-05-07T20:33:00.4487402Z )
2025-05-07T20:33:00.4487714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4488211Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.4488493Z 
2025-05-07T20:33:00.4488575Z     @given(
2025-05-07T20:33:00.4488811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4489129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4489443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4489773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4490098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4490385Z     )
2025-05-07T20:33:00.4490732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4491167Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4491412Z         self,
2025-05-07T20:33:00.4491608Z         T: int,
2025-05-07T20:33:00.4491933Z         D: int,
2025-05-07T20:33:00.4492155Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4492430Z         contiguous: bool,
2025-05-07T20:33:00.4492675Z         compiled: bool,
2025-05-07T20:33:00.4492934Z     ) -> None:
2025-05-07T20:33:00.4493154Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4493397Z     
2025-05-07T20:33:00.4493663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4494009Z     
2025-05-07T20:33:00.4494203Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.4494488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.4494799Z         x = x_sign * x_clamp
2025-05-07T20:33:00.4495044Z         x0 = x[:, :D]
2025-05-07T20:33:00.4495258Z         x1 = x[:, D:]
2025-05-07T20:33:00.4495470Z     
2025-05-07T20:33:00.4495659Z         if contiguous:
2025-05-07T20:33:00.4495886Z             x0 = x0.contiguous()
2025-05-07T20:33:00.4496146Z             x1 = x1.contiguous()
2025-05-07T20:33:00.4496394Z     
2025-05-07T20:33:00.4496593Z         if scale_ub is not None:
2025-05-07T20:33:00.4496870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.4497209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.4497524Z             )
2025-05-07T20:33:00.4497717Z         else:
2025-05-07T20:33:00.4497932Z             scale_ub_tensor = None
2025-05-07T20:33:00.4498188Z     
2025-05-07T20:33:00.4498421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4498740Z             op = silu_mul_quant
2025-05-07T20:33:00.4498993Z             if compiled:
2025-05-07T20:33:00.4499236Z                 op = torch.compile(op)
2025-05-07T20:33:00.4499534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4499812Z     
2025-05-07T20:33:00.4500004Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.4500178Z 
2025-05-07T20:33:00.4500277Z moe/activation_test.py:117: 
2025-05-07T20:33:00.4500577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4500923Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.4501202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4501761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.4502323Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.4502981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.4503659Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.4504208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.4504885Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.4505544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.4506067Z     kernel = self.compile(
2025-05-07T20:33:00.4506610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.4507265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4507660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4507895Z 
2025-05-07T20:33:00.4508100Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6685b140>
2025-05-07T20:33:00.4509177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.4510551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b3f60>}
2025-05-07T20:33:00.4511971Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.4513029Z context = <triton._C.libtriton.ir.context object at 0x7f8a699f4bf0>
2025-05-07T20:33:00.4513362Z 
2025-05-07T20:33:00.4513529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.4514058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4514531Z                            module_map=module_map)
2025-05-07T20:33:00.4514890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4515248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4515513Z E       ^
2025-05-07T20:33:00.4516081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4516540Z 
2025-05-07T20:33:00.4516956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4517471Z 
2025-05-07T20:33:00.5678300Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5678742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5679147Z     T=2048,
2025-05-07T20:33:00.5679341Z     D=7168,
2025-05-07T20:33:00.5679538Z     scale_ub=1200.0,
2025-05-07T20:33:00.5679763Z     contiguous=False,
2025-05-07T20:33:00.5679992Z     compiled=False,
2025-05-07T20:33:00.5680202Z )
2025-05-07T20:33:00.5680520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5681020Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.5681307Z 
2025-05-07T20:33:00.5681387Z     @given(
2025-05-07T20:33:00.5681627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5681942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5682273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5682610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5682938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5683228Z     )
2025-05-07T20:33:00.5683578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5684023Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5684264Z         self,
2025-05-07T20:33:00.5684469Z         T: int,
2025-05-07T20:33:00.5684672Z         D: int,
2025-05-07T20:33:00.5684891Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5685187Z         contiguous: bool,
2025-05-07T20:33:00.5685435Z         compiled: bool,
2025-05-07T20:33:00.5685662Z     ) -> None:
2025-05-07T20:33:00.5685882Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5686127Z     
2025-05-07T20:33:00.5686402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5686758Z     
2025-05-07T20:33:00.5686961Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.5687276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.5687594Z         x = x_sign * x_clamp
2025-05-07T20:33:00.5687837Z         x0 = x[:, :D]
2025-05-07T20:33:00.5688061Z         x1 = x[:, D:]
2025-05-07T20:33:00.5688276Z     
2025-05-07T20:33:00.5688467Z         if contiguous:
2025-05-07T20:33:00.5688703Z             x0 = x0.contiguous()
2025-05-07T20:33:00.5688977Z             x1 = x1.contiguous()
2025-05-07T20:33:00.5689228Z     
2025-05-07T20:33:00.5689428Z         if scale_ub is not None:
2025-05-07T20:33:00.5689715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.5690048Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.5690372Z             )
2025-05-07T20:33:00.5690576Z         else:
2025-05-07T20:33:00.5690798Z             scale_ub_tensor = None
2025-05-07T20:33:00.5691048Z     
2025-05-07T20:33:00.5691287Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.5691891Z             op = silu_mul_quant
2025-05-07T20:33:00.5692142Z             if compiled:
2025-05-07T20:33:00.5692397Z                 op = torch.compile(op)
2025-05-07T20:33:00.5701955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.5702277Z     
2025-05-07T20:33:00.5702487Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.5702654Z 
2025-05-07T20:33:00.5702769Z moe/activation_test.py:117: 
2025-05-07T20:33:00.5703071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.5703405Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.5703694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.5704396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.5705083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.5705633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.5706321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.5706987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.5707518Z     kernel = self.compile(
2025-05-07T20:33:00.5708064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.5708722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.5709121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.5709359Z 
2025-05-07T20:33:00.5709568Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6685acc0>
2025-05-07T20:33:00.5710656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.5712052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b1440>}
2025-05-07T20:33:00.5713404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.5714423Z context = <triton._C.libtriton.ir.context object at 0x7f8a6986f030>
2025-05-07T20:33:00.5714720Z 
2025-05-07T20:33:00.5714888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.5715421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.5715988Z                            module_map=module_map)
2025-05-07T20:33:00.5716360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.5716723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.5716987Z E       ^
2025-05-07T20:33:00.5717446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.5717908Z 
2025-05-07T20:33:00.5718319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.5718840Z 
2025-05-07T20:33:00.5718949Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5719366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5719769Z     T=1,
2025-05-07T20:33:00.5719965Z     D=7168,
2025-05-07T20:33:00.5720166Z     scale_ub=None,
2025-05-07T20:33:00.5720378Z     contiguous=True,
2025-05-07T20:33:00.5720607Z     compiled=False,
2025-05-07T20:33:00.5720818Z )
2025-05-07T20:33:00.5721135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5721793Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.5722054Z 
2025-05-07T20:33:00.5722143Z     @given(
2025-05-07T20:33:00.5722380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5722735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5723046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5723378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5723702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5723995Z     )
2025-05-07T20:33:00.5724352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5724789Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5725039Z         self,
2025-05-07T20:33:00.5725241Z         T: int,
2025-05-07T20:33:00.5725437Z         D: int,
2025-05-07T20:33:00.5725668Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5725949Z         contiguous: bool,
2025-05-07T20:33:00.5726197Z         compiled: bool,
2025-05-07T20:33:00.5726438Z     ) -> None:
2025-05-07T20:33:00.5726667Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5726910Z     
2025-05-07T20:33:00.5727199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5727550Z     
2025-05-07T20:33:00.5727757Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.5728050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.5728374Z         x = x_sign * x_clamp
2025-05-07T20:33:00.5728624Z         x0 = x[:, :D]
2025-05-07T20:33:00.5728840Z         x1 = x[:, D:]
2025-05-07T20:33:00.5729058Z     
2025-05-07T20:33:00.5729254Z         if contiguous:
2025-05-07T20:33:00.5729490Z             x0 = x0.contiguous()
2025-05-07T20:33:00.5729757Z             x1 = x1.contiguous()
2025-05-07T20:33:00.5730006Z     
2025-05-07T20:33:00.5730201Z         if scale_ub is not None:
2025-05-07T20:33:00.5730491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.5730838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.5731146Z             )
2025-05-07T20:33:00.5731350Z         else:
2025-05-07T20:33:00.5731575Z             scale_ub_tensor = None
2025-05-07T20:33:00.5731831Z     
2025-05-07T20:33:00.5732068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.5732398Z             op = silu_mul_quant
2025-05-07T20:33:00.5732659Z             if compiled:
2025-05-07T20:33:00.5732908Z                 op = torch.compile(op)
2025-05-07T20:33:00.5733218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.5733501Z     
2025-05-07T20:33:00.5733697Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.5733869Z 
2025-05-07T20:33:00.5733971Z moe/activation_test.py:117: 
2025-05-07T20:33:00.5734279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.5734613Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.5734901Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.5735611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.5736307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.5736844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.5737535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.5738204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.5738734Z     kernel = self.compile(
2025-05-07T20:33:00.5739279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.5739937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.5740428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.5740700Z 
2025-05-07T20:33:00.5740906Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66858140>
2025-05-07T20:33:00.5741996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.5743411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66837740>}
2025-05-07T20:33:00.5744764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.5745800Z context = <triton._C.libtriton.ir.context object at 0x7f8a69c27470>
2025-05-07T20:33:00.5746093Z 
2025-05-07T20:33:00.5746274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.5746808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.5747291Z                            module_map=module_map)
2025-05-07T20:33:00.5747663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.5748058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.5748355Z E       ^
2025-05-07T20:33:00.5748824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.5749277Z 
2025-05-07T20:33:00.5749689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.5750216Z 
2025-05-07T20:33:00.5750324Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5750753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5751171Z     T=16384,
2025-05-07T20:33:00.5751380Z     D=7168,
2025-05-07T20:33:00.5751590Z     scale_ub=1200.0,
2025-05-07T20:33:00.5751813Z     contiguous=False,
2025-05-07T20:33:00.5752048Z     compiled=True,
2025-05-07T20:33:00.8176741Z )
2025-05-07T20:33:00.8177152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.8177776Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.8178066Z 
2025-05-07T20:33:00.8178146Z     @given(
2025-05-07T20:33:00.8178381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.8178697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.8179005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.8179342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.8179675Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.8179959Z     )
2025-05-07T20:33:00.8180335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.8180781Z     def test_silu_mul_quant(
2025-05-07T20:33:00.8181030Z         self,
2025-05-07T20:33:00.8181223Z         T: int,
2025-05-07T20:33:00.8181434Z         D: int,
2025-05-07T20:33:00.8181656Z         scale_ub: Optional[float],
2025-05-07T20:33:00.8181926Z         contiguous: bool,
2025-05-07T20:33:00.8182165Z         compiled: bool,
2025-05-07T20:33:00.8182398Z     ) -> None:
2025-05-07T20:33:00.8182610Z         torch.manual_seed(2025)
2025-05-07T20:33:00.8182856Z     
2025-05-07T20:33:00.8183128Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.8183465Z     
2025-05-07T20:33:00.8183661Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.8183956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.8184261Z         x = x_sign * x_clamp
2025-05-07T20:33:00.8184501Z         x0 = x[:, :D]
2025-05-07T20:33:00.8184720Z         x1 = x[:, D:]
2025-05-07T20:33:00.8184920Z     
2025-05-07T20:33:00.8185660Z         if contiguous:
2025-05-07T20:33:00.8185893Z             x0 = x0.contiguous()
2025-05-07T20:33:00.8186151Z             x1 = x1.contiguous()
2025-05-07T20:33:00.8186384Z     
2025-05-07T20:33:00.8186664Z         if scale_ub is not None:
2025-05-07T20:33:00.8186937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.8187273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.8187583Z             )
2025-05-07T20:33:00.8187777Z         else:
2025-05-07T20:33:00.8187983Z             scale_ub_tensor = None
2025-05-07T20:33:00.8188240Z     
2025-05-07T20:33:00.8188477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8188790Z             op = silu_mul_quant
2025-05-07T20:33:00.8189041Z             if compiled:
2025-05-07T20:33:00.8189294Z                 op = torch.compile(op)
2025-05-07T20:33:00.8189590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8189871Z     
2025-05-07T20:33:00.8190080Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.8190243Z 
2025-05-07T20:33:00.8190350Z moe/activation_test.py:117: 
2025-05-07T20:33:00.8190643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8190980Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.8191265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8191815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.8192373Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.8193030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.8193711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.8194248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.8194930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.8195591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.8196214Z     kernel = self.compile(
2025-05-07T20:33:00.8196755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.8197404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.8197802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8198033Z 
2025-05-07T20:33:00.8198239Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66945eb0>
2025-05-07T20:33:00.8199319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.8200706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668345e0>}
2025-05-07T20:33:00.8202048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.8203065Z context = <triton._C.libtriton.ir.context object at 0x7f8a69624030>
2025-05-07T20:33:00.8203355Z 
2025-05-07T20:33:00.8203520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.8204041Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.8204516Z                            module_map=module_map)
2025-05-07T20:33:00.8204873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.8205230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.8205495Z E       ^
2025-05-07T20:33:00.8206112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8206569Z 
2025-05-07T20:33:00.8206980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8207530Z 
2025-05-07T20:33:00.8207634Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8208048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8208447Z     T=1,
2025-05-07T20:33:00.8208635Z     D=7168,
2025-05-07T20:33:00.8208834Z     scale_ub=None,
2025-05-07T20:33:00.8209070Z     contiguous=False,
2025-05-07T20:33:00.8209292Z     compiled=False,
2025-05-07T20:33:00.8209499Z )
2025-05-07T20:33:00.8209819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.8210301Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.8210572Z 
2025-05-07T20:33:00.8210655Z     @given(
2025-05-07T20:33:00.8210895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.8211216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.8211524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.8211859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.8212195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.8212480Z     )
2025-05-07T20:33:00.8212836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.8213281Z     def test_silu_mul_quant(
2025-05-07T20:33:00.8213522Z         self,
2025-05-07T20:33:00.8213723Z         T: int,
2025-05-07T20:33:00.8213926Z         D: int,
2025-05-07T20:33:00.8214143Z         scale_ub: Optional[float],
2025-05-07T20:33:00.8214417Z         contiguous: bool,
2025-05-07T20:33:00.8214665Z         compiled: bool,
2025-05-07T20:33:00.8214898Z     ) -> None:
2025-05-07T20:33:00.8215113Z         torch.manual_seed(2025)
2025-05-07T20:33:00.8215362Z     
2025-05-07T20:33:00.8215638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.8215975Z     
2025-05-07T20:33:00.8216178Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.8216477Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.8216787Z         x = x_sign * x_clamp
2025-05-07T20:33:00.8217030Z         x0 = x[:, :D]
2025-05-07T20:33:00.8217251Z         x1 = x[:, D:]
2025-05-07T20:33:00.8217454Z     
2025-05-07T20:33:00.8217647Z         if contiguous:
2025-05-07T20:33:00.8217881Z             x0 = x0.contiguous()
2025-05-07T20:33:00.8218136Z             x1 = x1.contiguous()
2025-05-07T20:33:00.8218385Z     
2025-05-07T20:33:00.8218581Z         if scale_ub is not None:
2025-05-07T20:33:00.8218848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.8219185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.8219500Z             )
2025-05-07T20:33:00.8219702Z         else:
2025-05-07T20:33:00.8219913Z             scale_ub_tensor = None
2025-05-07T20:33:00.8220169Z     
2025-05-07T20:33:00.8220401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8220715Z             op = silu_mul_quant
2025-05-07T20:33:00.8220965Z             if compiled:
2025-05-07T20:33:00.8221221Z                 op = torch.compile(op)
2025-05-07T20:33:00.8221514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8221793Z     
2025-05-07T20:33:00.8221991Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.8222158Z 
2025-05-07T20:33:00.8222256Z moe/activation_test.py:117: 
2025-05-07T20:33:00.8222555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8222887Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.8223162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8223933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.8224663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.8225207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.8225927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.8226596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.8227145Z     kernel = self.compile(
2025-05-07T20:33:00.8227685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.8228342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.8228749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8228982Z 
2025-05-07T20:33:00.8229195Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a2480>
2025-05-07T20:33:00.8230294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.8231664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b674593a0>}
2025-05-07T20:33:00.8233005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.8234030Z context = <triton._C.libtriton.ir.context object at 0x7f8a695b0bf0>
2025-05-07T20:33:00.8234316Z 
2025-05-07T20:33:00.8234489Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.8235013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.8235488Z                            module_map=module_map)
2025-05-07T20:33:00.8235908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.8236270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.8236529Z E       ^
2025-05-07T20:33:00.8236993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8237441Z 
2025-05-07T20:33:00.8237857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8238362Z 
2025-05-07T20:33:00.8238466Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8238881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8239288Z     T=2048,
2025-05-07T20:33:00.8239478Z     D=7168,
2025-05-07T20:33:00.8239665Z     scale_ub=None,
2025-05-07T20:33:00.8239889Z     contiguous=False,
2025-05-07T20:33:00.8240116Z     compiled=True,
2025-05-07T20:33:00.8240313Z )
2025-05-07T20:33:00.9126668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.9128044Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.9128395Z 
2025-05-07T20:33:00.9128478Z     @given(
2025-05-07T20:33:00.9128716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.9129023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.9129334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.9129666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.9130001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.9130300Z     )
2025-05-07T20:33:00.9130646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.9131095Z     def test_silu_mul_quant(
2025-05-07T20:33:00.9131345Z         self,
2025-05-07T20:33:00.9131938Z         T: int,
2025-05-07T20:33:00.9132139Z         D: int,
2025-05-07T20:33:00.9132360Z         scale_ub: Optional[float],
2025-05-07T20:33:00.9132629Z         contiguous: bool,
2025-05-07T20:33:00.9132947Z         compiled: bool,
2025-05-07T20:33:00.9133173Z     ) -> None:
2025-05-07T20:33:00.9133387Z         torch.manual_seed(2025)
2025-05-07T20:33:00.9133629Z     
2025-05-07T20:33:00.9133900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.9134238Z     
2025-05-07T20:33:00.9134435Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.9134727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.9135039Z         x = x_sign * x_clamp
2025-05-07T20:33:00.9135273Z         x0 = x[:, :D]
2025-05-07T20:33:00.9135494Z         x1 = x[:, D:]
2025-05-07T20:33:00.9135703Z     
2025-05-07T20:33:00.9135884Z         if contiguous:
2025-05-07T20:33:00.9136115Z             x0 = x0.contiguous()
2025-05-07T20:33:00.9136386Z             x1 = x1.contiguous()
2025-05-07T20:33:00.9136624Z     
2025-05-07T20:33:00.9136824Z         if scale_ub is not None:
2025-05-07T20:33:00.9137097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.9137430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.9137741Z             )
2025-05-07T20:33:00.9137941Z         else:
2025-05-07T20:33:00.9138150Z             scale_ub_tensor = None
2025-05-07T20:33:00.9138408Z     
2025-05-07T20:33:00.9138646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.9138965Z             op = silu_mul_quant
2025-05-07T20:33:00.9139225Z             if compiled:
2025-05-07T20:33:00.9139480Z                 op = torch.compile(op)
2025-05-07T20:33:00.9139768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.9140050Z     
2025-05-07T20:33:00.9140252Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.9140417Z 
2025-05-07T20:33:00.9140528Z moe/activation_test.py:117: 
2025-05-07T20:33:00.9140821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.9141156Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.9141439Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.9141994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.9142560Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.9143218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.9143916Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.9144452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.9145136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.9145810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.9146335Z     kernel = self.compile(
2025-05-07T20:33:00.9146880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.9147536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.9147934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.9148162Z 
2025-05-07T20:33:00.9148367Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a2510>
2025-05-07T20:33:00.9149444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.9150935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67115b20>}
2025-05-07T20:33:00.9152317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.9153379Z context = <triton._C.libtriton.ir.context object at 0x7f8a69ce0a30>
2025-05-07T20:33:00.9153668Z 
2025-05-07T20:33:00.9153832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.9154350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.9154823Z                            module_map=module_map)
2025-05-07T20:33:00.9155179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.9155533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.9155947Z E       ^
2025-05-07T20:33:00.9156416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.9156876Z 
2025-05-07T20:33:00.9157291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.9157808Z 
2025-05-07T20:33:00.9157912Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.9158326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.9158756Z     T=4096,
2025-05-07T20:33:00.9158951Z     D=7168,
2025-05-07T20:33:00.9159151Z     scale_ub=None,
2025-05-07T20:33:00.9159363Z     contiguous=False,
2025-05-07T20:33:00.9159596Z     compiled=True,
2025-05-07T20:33:00.9159803Z )
2025-05-07T20:33:00.9160128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.9160635Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.9160918Z 
2025-05-07T20:33:00.9160997Z     @given(
2025-05-07T20:33:00.9161237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.9161557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.9161878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.9162217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.9162548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.9162841Z     )
2025-05-07T20:33:00.9163197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.9163638Z     def test_silu_mul_quant(
2025-05-07T20:33:00.9163896Z         self,
2025-05-07T20:33:00.9174199Z         T: int,
2025-05-07T20:33:00.9174515Z         D: int,
2025-05-07T20:33:00.9174739Z         scale_ub: Optional[float],
2025-05-07T20:33:00.9175023Z         contiguous: bool,
2025-05-07T20:33:00.9175272Z         compiled: bool,
2025-05-07T20:33:00.9175501Z     ) -> None:
2025-05-07T20:33:00.9175726Z         torch.manual_seed(2025)
2025-05-07T20:33:00.9175976Z     
2025-05-07T20:33:00.9176265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.9176631Z     
2025-05-07T20:33:00.9176836Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.9177133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.9177458Z         x = x_sign * x_clamp
2025-05-07T20:33:00.9177712Z         x0 = x[:, :D]
2025-05-07T20:33:00.9177935Z         x1 = x[:, D:]
2025-05-07T20:33:00.9178154Z     
2025-05-07T20:33:00.9178350Z         if contiguous:
2025-05-07T20:33:00.9178587Z             x0 = x0.contiguous()
2025-05-07T20:33:00.9178871Z             x1 = x1.contiguous()
2025-05-07T20:33:00.9179218Z     
2025-05-07T20:33:00.9179487Z         if scale_ub is not None:
2025-05-07T20:33:00.9179878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.9180300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.9180626Z             )
2025-05-07T20:33:00.9180824Z         else:
2025-05-07T20:33:00.9181046Z             scale_ub_tensor = None
2025-05-07T20:33:00.9181311Z     
2025-05-07T20:33:00.9181807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.9182135Z             op = silu_mul_quant
2025-05-07T20:33:00.9182393Z             if compiled:
2025-05-07T20:33:00.9182699Z                 op = torch.compile(op)
2025-05-07T20:33:00.9183001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.9183285Z     
2025-05-07T20:33:00.9183481Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.9183653Z 
2025-05-07T20:33:00.9183755Z moe/activation_test.py:117: 
2025-05-07T20:33:00.9184062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.9184402Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.9184683Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.9185254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.9185819Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.9186486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.9187182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.9187718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.9188409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.9189069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.9189606Z     kernel = self.compile(
2025-05-07T20:33:00.9190151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.9190804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.9191212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.9191459Z 
2025-05-07T20:33:00.9191677Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a1970>
2025-05-07T20:33:00.9192768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.9194143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67115760>}
2025-05-07T20:33:00.9195495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.9196612Z context = <triton._C.libtriton.ir.context object at 0x7f8b667156f0>
2025-05-07T20:33:00.9196901Z 
2025-05-07T20:33:00.9197084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.9197621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.9198095Z                            module_map=module_map)
2025-05-07T20:33:00.9198470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.9198841Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.9199100Z E       ^
2025-05-07T20:33:00.9199577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.9200035Z 
2025-05-07T20:33:00.9200465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.9200979Z 
2025-05-07T20:33:01.0801997Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0802695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0803267Z     T=16384,
2025-05-07T20:33:01.0803547Z     D=5120,
2025-05-07T20:33:01.0804136Z     scale_ub=1200.0,
2025-05-07T20:33:01.0804372Z     contiguous=False,
2025-05-07T20:33:01.0804604Z     compiled=False,
2025-05-07T20:33:01.0804818Z )
2025-05-07T20:33:01.0805148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0805749Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.0806035Z 
2025-05-07T20:33:01.0806118Z     @given(
2025-05-07T20:33:01.0806356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0806679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0806997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0807326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0807659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0807950Z     )
2025-05-07T20:33:01.0808297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0808757Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0809006Z         self,
2025-05-07T20:33:01.0809198Z         T: int,
2025-05-07T20:33:01.0809403Z         D: int,
2025-05-07T20:33:01.0809635Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0809911Z         contiguous: bool,
2025-05-07T20:33:01.0810210Z         compiled: bool,
2025-05-07T20:33:01.0810491Z     ) -> None:
2025-05-07T20:33:01.0810710Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0810957Z     
2025-05-07T20:33:01.0811239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0811592Z     
2025-05-07T20:33:01.0811787Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0812087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0812408Z         x = x_sign * x_clamp
2025-05-07T20:33:01.0812648Z         x0 = x[:, :D]
2025-05-07T20:33:01.0812871Z         x1 = x[:, D:]
2025-05-07T20:33:01.0813088Z     
2025-05-07T20:33:01.0813274Z         if contiguous:
2025-05-07T20:33:01.0813524Z             x0 = x0.contiguous()
2025-05-07T20:33:01.0813791Z             x1 = x1.contiguous()
2025-05-07T20:33:01.0814032Z     
2025-05-07T20:33:01.0814236Z         if scale_ub is not None:
2025-05-07T20:33:01.0814522Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.0814856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.0815178Z             )
2025-05-07T20:33:01.0815377Z         else:
2025-05-07T20:33:01.0815590Z             scale_ub_tensor = None
2025-05-07T20:33:01.0815842Z     
2025-05-07T20:33:01.0816069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0816387Z             op = silu_mul_quant
2025-05-07T20:33:01.0816641Z             if compiled:
2025-05-07T20:33:01.0816883Z                 op = torch.compile(op)
2025-05-07T20:33:01.0817179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0817454Z     
2025-05-07T20:33:01.0817643Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.0817813Z 
2025-05-07T20:33:01.0817921Z moe/activation_test.py:117: 
2025-05-07T20:33:01.0818221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0818562Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.0818841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0819535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.0820228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.0820759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.0821443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.0822106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.0822644Z     kernel = self.compile(
2025-05-07T20:33:01.0823279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.0823997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.0824434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0824664Z 
2025-05-07T20:33:01.0824874Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91bf73b0>
2025-05-07T20:33:01.0825952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.0827336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b671147c0>}
2025-05-07T20:33:01.0828683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.0829708Z context = <triton._C.libtriton.ir.context object at 0x7f8b66731ab0>
2025-05-07T20:33:01.0829997Z 
2025-05-07T20:33:01.0830162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.0830684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.0831157Z                            module_map=module_map)
2025-05-07T20:33:01.0831526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.0831880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.0832146Z E       ^
2025-05-07T20:33:01.0832618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.0833066Z 
2025-05-07T20:33:01.0833482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.0834001Z 
2025-05-07T20:33:01.0834105Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0834523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0834930Z     T=16384,
2025-05-07T20:33:01.0835123Z     D=5120,
2025-05-07T20:33:01.0835322Z     scale_ub=1200.0,
2025-05-07T20:33:01.0835548Z     contiguous=True,
2025-05-07T20:33:01.0835843Z     compiled=True,
2025-05-07T20:33:01.0836056Z )
2025-05-07T20:33:01.0836380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0836875Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.0837158Z 
2025-05-07T20:33:01.0837237Z     @given(
2025-05-07T20:33:01.0837474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0837787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0838094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0838433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0838762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0839044Z     )
2025-05-07T20:33:01.0839395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0839837Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0840074Z         self,
2025-05-07T20:33:01.0840276Z         T: int,
2025-05-07T20:33:01.0840476Z         D: int,
2025-05-07T20:33:01.0840693Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0840965Z         contiguous: bool,
2025-05-07T20:33:01.0841212Z         compiled: bool,
2025-05-07T20:33:01.0841428Z     ) -> None:
2025-05-07T20:33:01.0841650Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0841898Z     
2025-05-07T20:33:01.0842170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0842509Z     
2025-05-07T20:33:01.0842706Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0843163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0843471Z         x = x_sign * x_clamp
2025-05-07T20:33:01.0843716Z         x0 = x[:, :D]
2025-05-07T20:33:01.0843975Z         x1 = x[:, D:]
2025-05-07T20:33:01.0844184Z     
2025-05-07T20:33:01.0844375Z         if contiguous:
2025-05-07T20:33:01.0844614Z             x0 = x0.contiguous()
2025-05-07T20:33:01.0844866Z             x1 = x1.contiguous()
2025-05-07T20:33:01.0845110Z     
2025-05-07T20:33:01.0845308Z         if scale_ub is not None:
2025-05-07T20:33:01.0845579Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.0845919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.0846233Z             )
2025-05-07T20:33:01.0846422Z         else:
2025-05-07T20:33:01.0846636Z             scale_ub_tensor = None
2025-05-07T20:33:01.0846896Z     
2025-05-07T20:33:01.0847140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.0847461Z             op = silu_mul_quant
2025-05-07T20:33:01.0847717Z             if compiled:
2025-05-07T20:33:01.0847978Z                 op = torch.compile(op)
2025-05-07T20:33:01.0848274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0848559Z     
2025-05-07T20:33:01.0848757Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.0848920Z 
2025-05-07T20:33:01.0849022Z moe/activation_test.py:117: 
2025-05-07T20:33:01.0849324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0849661Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.0849939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.0850499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.0851059Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.0851725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.0852418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.0852961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.0853653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.0854317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.0854844Z     kernel = self.compile(
2025-05-07T20:33:01.0855380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.0856046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.0856441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.0856678Z 
2025-05-07T20:33:01.0856885Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f15640>
2025-05-07T20:33:01.0857976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.0859351Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef7740>}
2025-05-07T20:33:01.0860692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.0861711Z context = <triton._C.libtriton.ir.context object at 0x7f8b667cbcf0>
2025-05-07T20:33:01.0862005Z 
2025-05-07T20:33:01.0862171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.0862697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.0863292Z                            module_map=module_map)
2025-05-07T20:33:01.0863653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.0864015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.0864323Z E       ^
2025-05-07T20:33:01.0864780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.0865234Z 
2025-05-07T20:33:01.0866006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.0866530Z 
2025-05-07T20:33:01.2607869Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2608694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2609235Z     T=16384,
2025-05-07T20:33:01.2609442Z     D=5120,
2025-05-07T20:33:01.2609642Z     scale_ub=None,
2025-05-07T20:33:01.2609863Z     contiguous=False,
2025-05-07T20:33:01.2610127Z     compiled=True,
2025-05-07T20:33:01.2610343Z )
2025-05-07T20:33:01.2610666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2611178Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.2611476Z 
2025-05-07T20:33:01.2611562Z     @given(
2025-05-07T20:33:01.2611804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2612119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2612430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2612766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2613097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2613391Z     )
2025-05-07T20:33:01.2613748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2614188Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2614439Z         self,
2025-05-07T20:33:01.2614641Z         T: int,
2025-05-07T20:33:01.2614852Z         D: int,
2025-05-07T20:33:01.2615069Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2615344Z         contiguous: bool,
2025-05-07T20:33:01.2615584Z         compiled: bool,
2025-05-07T20:33:01.2615814Z     ) -> None:
2025-05-07T20:33:01.2616041Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2616295Z     
2025-05-07T20:33:01.2616569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2616916Z     
2025-05-07T20:33:01.2617118Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2617408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2617721Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2617968Z         x0 = x[:, :D]
2025-05-07T20:33:01.2618185Z         x1 = x[:, D:]
2025-05-07T20:33:01.2618396Z     
2025-05-07T20:33:01.2618589Z         if contiguous:
2025-05-07T20:33:01.2618816Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2619082Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2619332Z     
2025-05-07T20:33:01.2619527Z         if scale_ub is not None:
2025-05-07T20:33:01.2619803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2620145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2620459Z             )
2025-05-07T20:33:01.2620650Z         else:
2025-05-07T20:33:01.2620867Z             scale_ub_tensor = None
2025-05-07T20:33:01.2621122Z     
2025-05-07T20:33:01.2621352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2621676Z             op = silu_mul_quant
2025-05-07T20:33:01.2621932Z             if compiled:
2025-05-07T20:33:01.2622177Z                 op = torch.compile(op)
2025-05-07T20:33:01.2622477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2622757Z     
2025-05-07T20:33:01.2622949Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.2623122Z 
2025-05-07T20:33:01.2623223Z moe/activation_test.py:117: 
2025-05-07T20:33:01.2623891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2624295Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.2624578Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2625138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.2625779Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.2626429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.2627120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.2627668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2628349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2629009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2629547Z     kernel = self.compile(
2025-05-07T20:33:01.2630092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2630739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2631148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2631386Z 
2025-05-07T20:33:01.2631591Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b668f09b0>
2025-05-07T20:33:01.2632674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2634143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69accb80>}
2025-05-07T20:33:01.2635487Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2636591Z context = <triton._C.libtriton.ir.context object at 0x7f8a698ec2b0>
2025-05-07T20:33:01.2636890Z 
2025-05-07T20:33:01.2637057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2637583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2638050Z                            module_map=module_map)
2025-05-07T20:33:01.2638419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2638780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2639044Z E       ^
2025-05-07T20:33:01.2639503Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2639957Z 
2025-05-07T20:33:01.2640376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2640883Z 
2025-05-07T20:33:01.2640995Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2641421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2641823Z     T=2048,
2025-05-07T20:33:01.2642019Z     D=5120,
2025-05-07T20:33:01.2642224Z     scale_ub=None,
2025-05-07T20:33:01.2642443Z     contiguous=False,
2025-05-07T20:33:01.2642681Z     compiled=True,
2025-05-07T20:33:01.2642891Z )
2025-05-07T20:33:01.3574620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.3575392Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.3575770Z 
2025-05-07T20:33:01.3575875Z     @given(
2025-05-07T20:33:01.3576119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.3576434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.3577142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.3577480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.3577815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.3578179Z     )
2025-05-07T20:33:01.3578525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.3578970Z     def test_silu_mul_quant(
2025-05-07T20:33:01.3579215Z         self,
2025-05-07T20:33:01.3579405Z         T: int,
2025-05-07T20:33:01.3579604Z         D: int,
2025-05-07T20:33:01.3579824Z         scale_ub: Optional[float],
2025-05-07T20:33:01.3580095Z         contiguous: bool,
2025-05-07T20:33:01.3580333Z         compiled: bool,
2025-05-07T20:33:01.3580556Z     ) -> None:
2025-05-07T20:33:01.3580772Z         torch.manual_seed(2025)
2025-05-07T20:33:01.3581015Z     
2025-05-07T20:33:01.3581290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.3581636Z     
2025-05-07T20:33:01.3581839Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.3582129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.3582433Z         x = x_sign * x_clamp
2025-05-07T20:33:01.3582679Z         x0 = x[:, :D]
2025-05-07T20:33:01.3582896Z         x1 = x[:, D:]
2025-05-07T20:33:01.3583121Z     
2025-05-07T20:33:01.3583320Z         if contiguous:
2025-05-07T20:33:01.3583557Z             x0 = x0.contiguous()
2025-05-07T20:33:01.3583814Z             x1 = x1.contiguous()
2025-05-07T20:33:01.3584060Z     
2025-05-07T20:33:01.3584258Z         if scale_ub is not None:
2025-05-07T20:33:01.3584535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.3584877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.3585197Z             )
2025-05-07T20:33:01.3585395Z         else:
2025-05-07T20:33:01.3585604Z             scale_ub_tensor = None
2025-05-07T20:33:01.3585867Z     
2025-05-07T20:33:01.3586108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.3586421Z             op = silu_mul_quant
2025-05-07T20:33:01.3586678Z             if compiled:
2025-05-07T20:33:01.3586938Z                 op = torch.compile(op)
2025-05-07T20:33:01.3587233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3587512Z     
2025-05-07T20:33:01.3587710Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.3587875Z 
2025-05-07T20:33:01.3587973Z moe/activation_test.py:117: 
2025-05-07T20:33:01.3588280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3598388Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.3598743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3599316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.3599884Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.3600548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.3601242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.3601783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.3602469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.3603130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.3603668Z     kernel = self.compile(
2025-05-07T20:33:01.3604214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.3604863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.3605274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3605514Z 
2025-05-07T20:33:01.3605853Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66886540>
2025-05-07T20:33:01.3606981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.3608461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ace0c0>}
2025-05-07T20:33:01.3609797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.3610826Z context = <triton._C.libtriton.ir.context object at 0x7f8a69208bf0>
2025-05-07T20:33:01.3611127Z 
2025-05-07T20:33:01.3611296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.3611830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.3612304Z                            module_map=module_map)
2025-05-07T20:33:01.3612674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.3613043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.3613305Z E       ^
2025-05-07T20:33:01.3613778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.3614245Z 
2025-05-07T20:33:01.3614658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.3615169Z 
2025-05-07T20:33:01.3615287Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.3615699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.3616109Z     T=2048,
2025-05-07T20:33:01.3616307Z     D=5120,
2025-05-07T20:33:01.3616505Z     scale_ub=1200.0,
2025-05-07T20:33:01.3616755Z     contiguous=False,
2025-05-07T20:33:01.3616992Z     compiled=True,
2025-05-07T20:33:01.3617204Z )
2025-05-07T20:33:01.3617535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.3618041Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.3618317Z 
2025-05-07T20:33:01.3618402Z     @given(
2025-05-07T20:33:01.3618636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.3618963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.3619281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.3619614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.3619953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.3620248Z     )
2025-05-07T20:33:01.3620593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.3621038Z     def test_silu_mul_quant(
2025-05-07T20:33:01.3621300Z         self,
2025-05-07T20:33:01.3621501Z         T: int,
2025-05-07T20:33:01.3621697Z         D: int,
2025-05-07T20:33:01.3621927Z         scale_ub: Optional[float],
2025-05-07T20:33:01.3622204Z         contiguous: bool,
2025-05-07T20:33:01.3622449Z         compiled: bool,
2025-05-07T20:33:01.3622678Z     ) -> None:
2025-05-07T20:33:01.3622903Z         torch.manual_seed(2025)
2025-05-07T20:33:01.3623144Z     
2025-05-07T20:33:01.3623431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.3623779Z     
2025-05-07T20:33:01.3623976Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.3624274Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.3624591Z         x = x_sign * x_clamp
2025-05-07T20:33:01.3624834Z         x0 = x[:, :D]
2025-05-07T20:33:01.3625061Z         x1 = x[:, D:]
2025-05-07T20:33:01.3625273Z     
2025-05-07T20:33:01.3625450Z         if contiguous:
2025-05-07T20:33:01.3625686Z             x0 = x0.contiguous()
2025-05-07T20:33:01.3626082Z             x1 = x1.contiguous()
2025-05-07T20:33:01.3626330Z     
2025-05-07T20:33:01.3626533Z         if scale_ub is not None:
2025-05-07T20:33:01.3626821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.3627204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.3627518Z             )
2025-05-07T20:33:01.3627725Z         else:
2025-05-07T20:33:01.3627945Z             scale_ub_tensor = None
2025-05-07T20:33:01.3628200Z     
2025-05-07T20:33:01.3628461Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.3628795Z             op = silu_mul_quant
2025-05-07T20:33:01.3629040Z             if compiled:
2025-05-07T20:33:01.3629304Z                 op = torch.compile(op)
2025-05-07T20:33:01.3629612Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3629895Z     
2025-05-07T20:33:01.3630097Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.3630262Z 
2025-05-07T20:33:01.3630376Z moe/activation_test.py:117: 
2025-05-07T20:33:01.3630682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3631022Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.3631313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.3631883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.3632443Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.3633108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.3633805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.3634338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.3635021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.3635694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.3636303Z     kernel = self.compile(
2025-05-07T20:33:01.3636840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.3637499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.3637899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.3638134Z 
2025-05-07T20:33:01.3638354Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b668875f0>
2025-05-07T20:33:01.3639432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.3640815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69acf2e0>}
2025-05-07T20:33:01.3642172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.3643213Z context = <triton._C.libtriton.ir.context object at 0x7f8a6940d970>
2025-05-07T20:33:01.3643503Z 
2025-05-07T20:33:01.3643673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.3644211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.3644689Z                            module_map=module_map)
2025-05-07T20:33:01.3645066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.3645425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.3645693Z E       ^
2025-05-07T20:33:01.3646260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.3646753Z 
2025-05-07T20:33:01.3647171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.3647727Z 
2025-05-07T20:33:01.5429673Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.5430949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.5432074Z     T=4096,
2025-05-07T20:33:01.5432504Z     D=5120,
2025-05-07T20:33:01.5432885Z     scale_ub=1200.0,
2025-05-07T20:33:01.5433333Z     contiguous=True,
2025-05-07T20:33:01.5433772Z     compiled=True,
2025-05-07T20:33:01.5434164Z )
2025-05-07T20:33:01.5434807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.5435892Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.5436439Z 
2025-05-07T20:33:01.5436594Z     @given(
2025-05-07T20:33:01.5437048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.5437704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.5438305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.5438682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.5439022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.5439306Z     )
2025-05-07T20:33:01.5439653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.5440095Z     def test_silu_mul_quant(
2025-05-07T20:33:01.5440332Z         self,
2025-05-07T20:33:01.5440527Z         T: int,
2025-05-07T20:33:01.5440724Z         D: int,
2025-05-07T20:33:01.5440940Z         scale_ub: Optional[float],
2025-05-07T20:33:01.5441213Z         contiguous: bool,
2025-05-07T20:33:01.5441454Z         compiled: bool,
2025-05-07T20:33:01.5441679Z     ) -> None:
2025-05-07T20:33:01.5441896Z         torch.manual_seed(2025)
2025-05-07T20:33:01.5442141Z     
2025-05-07T20:33:01.5442414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.5442755Z     
2025-05-07T20:33:01.5442950Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.5443240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.5443547Z         x = x_sign * x_clamp
2025-05-07T20:33:01.5443787Z         x0 = x[:, :D]
2025-05-07T20:33:01.5444005Z         x1 = x[:, D:]
2025-05-07T20:33:01.5444208Z     
2025-05-07T20:33:01.5444397Z         if contiguous:
2025-05-07T20:33:01.5444630Z             x0 = x0.contiguous()
2025-05-07T20:33:01.5444886Z             x1 = x1.contiguous()
2025-05-07T20:33:01.5445128Z     
2025-05-07T20:33:01.5445323Z         if scale_ub is not None:
2025-05-07T20:33:01.5445594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.5445932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.5446243Z             )
2025-05-07T20:33:01.5446436Z         else:
2025-05-07T20:33:01.5446650Z             scale_ub_tensor = None
2025-05-07T20:33:01.5446913Z     
2025-05-07T20:33:01.5447148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.5447461Z             op = silu_mul_quant
2025-05-07T20:33:01.5447717Z             if compiled:
2025-05-07T20:33:01.5447969Z                 op = torch.compile(op)
2025-05-07T20:33:01.5448260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5448540Z     
2025-05-07T20:33:01.5448744Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.5448909Z 
2025-05-07T20:33:01.5449010Z moe/activation_test.py:117: 
2025-05-07T20:33:01.5449305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5449641Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.5449917Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5450477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.5451038Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.5452013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.5452796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.5453409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.5454092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.5454754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.5455277Z     kernel = self.compile(
2025-05-07T20:33:01.5455813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.5456462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.5456855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5457089Z 
2025-05-07T20:33:01.5457305Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b40e30>
2025-05-07T20:33:01.5458382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.5459771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948c860>}
2025-05-07T20:33:01.5461107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.5462121Z context = <triton._C.libtriton.ir.context object at 0x7f8a69418370>
2025-05-07T20:33:01.5462413Z 
2025-05-07T20:33:01.5462583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.5463104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.5463575Z                            module_map=module_map)
2025-05-07T20:33:01.5463933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.5464289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.5464548Z E       ^
2025-05-07T20:33:01.5465005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.5465759Z 
2025-05-07T20:33:01.5466170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.5466683Z 
2025-05-07T20:33:01.5466790Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.5467201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.5467597Z     T=128,
2025-05-07T20:33:01.5467790Z     D=5120,
2025-05-07T20:33:01.5467996Z     scale_ub=1200.0,
2025-05-07T20:33:01.5468215Z     contiguous=False,
2025-05-07T20:33:01.5468440Z     compiled=True,
2025-05-07T20:33:01.5468648Z )
2025-05-07T20:33:01.8314488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8315263Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.8315640Z 
2025-05-07T20:33:01.8315826Z     @given(
2025-05-07T20:33:01.8316089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8316406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8316723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8317053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8317386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8317685Z     )
2025-05-07T20:33:01.8318035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8318954Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8319203Z         self,
2025-05-07T20:33:01.8319406Z         T: int,
2025-05-07T20:33:01.8319602Z         D: int,
2025-05-07T20:33:01.8319830Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8320204Z         contiguous: bool,
2025-05-07T20:33:01.8320447Z         compiled: bool,
2025-05-07T20:33:01.8320682Z     ) -> None:
2025-05-07T20:33:01.8320909Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8321150Z     
2025-05-07T20:33:01.8321428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8321780Z     
2025-05-07T20:33:01.8321973Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8322270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8322595Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8322834Z         x0 = x[:, :D]
2025-05-07T20:33:01.8323054Z         x1 = x[:, D:]
2025-05-07T20:33:01.8323267Z     
2025-05-07T20:33:01.8323451Z         if contiguous:
2025-05-07T20:33:01.8323697Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8323963Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8324203Z     
2025-05-07T20:33:01.8324397Z         if scale_ub is not None:
2025-05-07T20:33:01.8324681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8325026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8325333Z             )
2025-05-07T20:33:01.8325530Z         else:
2025-05-07T20:33:01.8325743Z             scale_ub_tensor = None
2025-05-07T20:33:01.8325994Z     
2025-05-07T20:33:01.8326229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8326571Z             op = silu_mul_quant
2025-05-07T20:33:01.8326823Z             if compiled:
2025-05-07T20:33:01.8327071Z                 op = torch.compile(op)
2025-05-07T20:33:01.8327367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8327643Z     
2025-05-07T20:33:01.8327836Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.8328006Z 
2025-05-07T20:33:01.8328111Z moe/activation_test.py:117: 
2025-05-07T20:33:01.8328409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8328740Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.8329031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8329599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.8330159Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.8330813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.8331509Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.8332049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8332729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8333413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8333956Z     kernel = self.compile(
2025-05-07T20:33:01.8334506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8335157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8335567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8335810Z 
2025-05-07T20:33:01.8336016Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b42780>
2025-05-07T20:33:01.8337100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8338577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948d580>}
2025-05-07T20:33:01.8339966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8341040Z context = <triton._C.libtriton.ir.context object at 0x7f8a6951a9b0>
2025-05-07T20:33:01.8341326Z 
2025-05-07T20:33:01.8341500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8342024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8342492Z                            module_map=module_map)
2025-05-07T20:33:01.8342866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8343224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.8343482Z E       ^
2025-05-07T20:33:01.8343952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8344405Z 
2025-05-07T20:33:01.8344826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8345335Z 
2025-05-07T20:33:01.8345453Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8345863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8346271Z     T=16384,
2025-05-07T20:33:01.8346475Z     D=7168,
2025-05-07T20:33:01.8346678Z     scale_ub=1200.0,
2025-05-07T20:33:01.8346901Z     contiguous=True,
2025-05-07T20:33:01.8347133Z     compiled=True,
2025-05-07T20:33:01.8347353Z )
2025-05-07T20:33:01.8347669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8348174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.8348455Z 
2025-05-07T20:33:01.8348543Z     @given(
2025-05-07T20:33:01.8348786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8349104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8349413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8349745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8350083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8350375Z     )
2025-05-07T20:33:01.8350730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8351167Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8351416Z         self,
2025-05-07T20:33:01.8351617Z         T: int,
2025-05-07T20:33:01.8351815Z         D: int,
2025-05-07T20:33:01.8352040Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8352317Z         contiguous: bool,
2025-05-07T20:33:01.8352554Z         compiled: bool,
2025-05-07T20:33:01.8352788Z     ) -> None:
2025-05-07T20:33:01.8353009Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8353253Z     
2025-05-07T20:33:01.8353530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8353885Z     
2025-05-07T20:33:01.8354084Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8354381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8354704Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8354953Z         x0 = x[:, :D]
2025-05-07T20:33:01.8355173Z         x1 = x[:, D:]
2025-05-07T20:33:01.8355392Z     
2025-05-07T20:33:01.8355590Z         if contiguous:
2025-05-07T20:33:01.8355883Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8356155Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8356406Z     
2025-05-07T20:33:01.8356602Z         if scale_ub is not None:
2025-05-07T20:33:01.8356881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8357223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8357533Z             )
2025-05-07T20:33:01.8357731Z         else:
2025-05-07T20:33:01.8358479Z             scale_ub_tensor = None
2025-05-07T20:33:01.8358756Z     
2025-05-07T20:33:01.8359029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8359356Z             op = silu_mul_quant
2025-05-07T20:33:01.8359647Z             if compiled:
2025-05-07T20:33:01.8359907Z                 op = torch.compile(op)
2025-05-07T20:33:01.8360213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8360492Z     
2025-05-07T20:33:01.8360700Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.8360875Z 
2025-05-07T20:33:01.8360979Z moe/activation_test.py:117: 
2025-05-07T20:33:01.8361276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8361604Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.8361885Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8362445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.8363012Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.8363669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.8364362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.8364900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8365857Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8366520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8367058Z     kernel = self.compile(
2025-05-07T20:33:01.8367590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8368244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8368683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8368939Z 
2025-05-07T20:33:01.8369151Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619c890>
2025-05-07T20:33:01.8370221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8371621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948e0c0>}
2025-05-07T20:33:01.8372974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8374002Z context = <triton._C.libtriton.ir.context object at 0x7f8a695474f0>
2025-05-07T20:33:01.8374297Z 
2025-05-07T20:33:01.8374484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8375016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8375504Z                            module_map=module_map)
2025-05-07T20:33:01.8375884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8376248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.8376524Z E       ^
2025-05-07T20:33:01.8376998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8377451Z 
2025-05-07T20:33:01.8377881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8378398Z 
2025-05-07T20:33:01.9622833Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9623485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9624537Z     T=16384,
2025-05-07T20:33:01.9624806Z     D=5120,
2025-05-07T20:33:01.9625054Z     scale_ub=1200.0,
2025-05-07T20:33:01.9625351Z     contiguous=True,
2025-05-07T20:33:01.9625644Z     compiled=False,
2025-05-07T20:33:01.9625938Z )
2025-05-07T20:33:01.9635464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9636034Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.9636324Z 
2025-05-07T20:33:01.9636406Z     @given(
2025-05-07T20:33:01.9636648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9636962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9637272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9637610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9637943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9638228Z     )
2025-05-07T20:33:01.9638595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9639045Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9639322Z         self,
2025-05-07T20:33:01.9639527Z         T: int,
2025-05-07T20:33:01.9639726Z         D: int,
2025-05-07T20:33:01.9639953Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9640233Z         contiguous: bool,
2025-05-07T20:33:01.9640470Z         compiled: bool,
2025-05-07T20:33:01.9640706Z     ) -> None:
2025-05-07T20:33:01.9640932Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9641173Z     
2025-05-07T20:33:01.9641451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9641808Z     
2025-05-07T20:33:01.9642004Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9642306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9642620Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9642864Z         x0 = x[:, :D]
2025-05-07T20:33:01.9643089Z         x1 = x[:, D:]
2025-05-07T20:33:01.9643310Z     
2025-05-07T20:33:01.9643494Z         if contiguous:
2025-05-07T20:33:01.9643732Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9643998Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9644235Z     
2025-05-07T20:33:01.9644433Z         if scale_ub is not None:
2025-05-07T20:33:01.9644710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9645049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9645355Z             )
2025-05-07T20:33:01.9645551Z         else:
2025-05-07T20:33:01.9645765Z             scale_ub_tensor = None
2025-05-07T20:33:01.9646013Z     
2025-05-07T20:33:01.9646251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9646574Z             op = silu_mul_quant
2025-05-07T20:33:01.9646822Z             if compiled:
2025-05-07T20:33:01.9647073Z                 op = torch.compile(op)
2025-05-07T20:33:01.9647370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9647640Z     
2025-05-07T20:33:01.9647846Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.9648008Z 
2025-05-07T20:33:01.9648115Z moe/activation_test.py:117: 
2025-05-07T20:33:01.9648407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9648750Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.9649040Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9649739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.9650424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.9650964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9651655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9652330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9653019Z     kernel = self.compile(
2025-05-07T20:33:01.9653566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9654223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9654664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9654906Z 
2025-05-07T20:33:01.9655114Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671d3d40>
2025-05-07T20:33:01.9656202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9657596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948f1a0>}
2025-05-07T20:33:01.9658955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9659982Z context = <triton._C.libtriton.ir.context object at 0x7f8b66cb7c70>
2025-05-07T20:33:01.9660281Z 
2025-05-07T20:33:01.9660450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9660984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9661462Z                            module_map=module_map)
2025-05-07T20:33:01.9661827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9662197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.9662468Z E       ^
2025-05-07T20:33:01.9662933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9663398Z 
2025-05-07T20:33:01.9663818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9664342Z 
2025-05-07T20:33:01.9664446Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9664878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9665279Z     T=1,
2025-05-07T20:33:01.9665732Z     D=7168,
2025-05-07T20:33:01.9665934Z     scale_ub=1200.0,
2025-05-07T20:33:01.9666159Z     contiguous=False,
2025-05-07T20:33:01.9666397Z     compiled=False,
2025-05-07T20:33:01.9666611Z )
2025-05-07T20:33:01.9666933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9667439Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.9667712Z 
2025-05-07T20:33:01.9667800Z     @given(
2025-05-07T20:33:01.9668030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9668357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9668680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9669018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9669346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9669645Z     )
2025-05-07T20:33:01.9669995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9670435Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9670682Z         self,
2025-05-07T20:33:01.9670888Z         T: int,
2025-05-07T20:33:01.9671087Z         D: int,
2025-05-07T20:33:01.9671320Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9671603Z         contiguous: bool,
2025-05-07T20:33:01.9671854Z         compiled: bool,
2025-05-07T20:33:01.9672091Z     ) -> None:
2025-05-07T20:33:01.9672319Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9672569Z     
2025-05-07T20:33:01.9672849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9673415Z     
2025-05-07T20:33:01.9673611Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9673906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9674225Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9674537Z         x0 = x[:, :D]
2025-05-07T20:33:01.9674755Z         x1 = x[:, D:]
2025-05-07T20:33:01.9674968Z     
2025-05-07T20:33:01.9675153Z         if contiguous:
2025-05-07T20:33:01.9675378Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9675642Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9675964Z     
2025-05-07T20:33:01.9676158Z         if scale_ub is not None:
2025-05-07T20:33:01.9676440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9676781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9677090Z             )
2025-05-07T20:33:01.9677291Z         else:
2025-05-07T20:33:01.9677508Z             scale_ub_tensor = None
2025-05-07T20:33:01.9677764Z     
2025-05-07T20:33:01.9678007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9678321Z             op = silu_mul_quant
2025-05-07T20:33:01.9678575Z             if compiled:
2025-05-07T20:33:01.9678824Z                 op = torch.compile(op)
2025-05-07T20:33:01.9679117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9679393Z     
2025-05-07T20:33:01.9679592Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.9679754Z 
2025-05-07T20:33:01.9679860Z moe/activation_test.py:117: 
2025-05-07T20:33:01.9680148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9680480Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.9680763Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9681444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.9682132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.9682674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9683354Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9684018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9684556Z     kernel = self.compile(
2025-05-07T20:33:01.9685094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9685739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9686138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9686374Z 
2025-05-07T20:33:01.9686581Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9066bb90>
2025-05-07T20:33:01.9687666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9689040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf0680>}
2025-05-07T20:33:01.9690377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9691403Z context = <triton._C.libtriton.ir.context object at 0x7f8b66c4d430>
2025-05-07T20:33:01.9691695Z 
2025-05-07T20:33:01.9691875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9692416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9692884Z                            module_map=module_map)
2025-05-07T20:33:01.9693387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9693755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.9694022Z E       ^
2025-05-07T20:33:01.9694499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9694991Z 
2025-05-07T20:33:01.9695411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9695922Z 
2025-05-07T20:33:02.1463547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1464783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1466247Z     T=4096,
2025-05-07T20:33:02.1466775Z     D=7168,
2025-05-07T20:33:02.1467222Z     scale_ub=1200.0,
2025-05-07T20:33:02.1467683Z     contiguous=False,
2025-05-07T20:33:02.1468130Z     compiled=True,
2025-05-07T20:33:02.1468537Z )
2025-05-07T20:33:02.1468998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.1469514Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.1469792Z 
2025-05-07T20:33:02.1469878Z     @given(
2025-05-07T20:33:02.1470112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.1470454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.1470770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.1471106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.1471437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.1471733Z     )
2025-05-07T20:33:02.1472088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.1472532Z     def test_silu_mul_quant(
2025-05-07T20:33:02.1472787Z         self,
2025-05-07T20:33:02.1472992Z         T: int,
2025-05-07T20:33:02.1473189Z         D: int,
2025-05-07T20:33:02.1473421Z         scale_ub: Optional[float],
2025-05-07T20:33:02.1473707Z         contiguous: bool,
2025-05-07T20:33:02.1473947Z         compiled: bool,
2025-05-07T20:33:02.1474183Z     ) -> None:
2025-05-07T20:33:02.1474409Z         torch.manual_seed(2025)
2025-05-07T20:33:02.1474657Z     
2025-05-07T20:33:02.1474935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.1475291Z     
2025-05-07T20:33:02.1475491Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.1475869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.1476193Z         x = x_sign * x_clamp
2025-05-07T20:33:02.1476439Z         x0 = x[:, :D]
2025-05-07T20:33:02.1476656Z         x1 = x[:, D:]
2025-05-07T20:33:02.1476870Z     
2025-05-07T20:33:02.1477063Z         if contiguous:
2025-05-07T20:33:02.1477295Z             x0 = x0.contiguous()
2025-05-07T20:33:02.1477558Z             x1 = x1.contiguous()
2025-05-07T20:33:02.1477804Z     
2025-05-07T20:33:02.1477999Z         if scale_ub is not None:
2025-05-07T20:33:02.1478284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.1478628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.1478944Z             )
2025-05-07T20:33:02.1479143Z         else:
2025-05-07T20:33:02.1479364Z             scale_ub_tensor = None
2025-05-07T20:33:02.1479616Z     
2025-05-07T20:33:02.1479853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.1480175Z             op = silu_mul_quant
2025-05-07T20:33:02.1480431Z             if compiled:
2025-05-07T20:33:02.1480680Z                 op = torch.compile(op)
2025-05-07T20:33:02.1480982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1481264Z     
2025-05-07T20:33:02.1481458Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.1481631Z 
2025-05-07T20:33:02.1481734Z moe/activation_test.py:117: 
2025-05-07T20:33:02.1482036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1482370Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.1483125Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1483699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.1484344Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.1485001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.1485695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.1486243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.1486926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.1487600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.1488148Z     kernel = self.compile(
2025-05-07T20:33:02.1488707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.1489365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.1489774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1490010Z 
2025-05-07T20:33:02.1490229Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67163dd0>
2025-05-07T20:33:02.1491317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.1492711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf1940>}
2025-05-07T20:33:02.1494068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.1495101Z context = <triton._C.libtriton.ir.context object at 0x7f8b66cff0b0>
2025-05-07T20:33:02.1495391Z 
2025-05-07T20:33:02.1495569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.1496091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.1496568Z                            module_map=module_map)
2025-05-07T20:33:02.1496943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.1497313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.1497573Z E       ^
2025-05-07T20:33:02.1498044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.1498497Z 
2025-05-07T20:33:02.1498973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.1499491Z 
2025-05-07T20:33:02.1499598Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1500021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1500433Z     T=128,
2025-05-07T20:33:02.1500627Z     D=7168,
2025-05-07T20:33:02.1500821Z     scale_ub=1200.0,
2025-05-07T20:33:02.1501050Z     contiguous=False,
2025-05-07T20:33:02.1501286Z     compiled=True,
2025-05-07T20:33:02.1501492Z )
2025-05-07T20:33:02.2429792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2430541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.2430901Z 
2025-05-07T20:33:02.2431020Z     @given(
2025-05-07T20:33:02.2431335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2431771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2432143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2432876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2433212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2433510Z     )
2025-05-07T20:33:02.2433860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2434388Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2434638Z         self,
2025-05-07T20:33:02.2434843Z         T: int,
2025-05-07T20:33:02.2435041Z         D: int,
2025-05-07T20:33:02.2435269Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2435550Z         contiguous: bool,
2025-05-07T20:33:02.2435884Z         compiled: bool,
2025-05-07T20:33:02.2436122Z     ) -> None:
2025-05-07T20:33:02.2436345Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2436592Z     
2025-05-07T20:33:02.2436871Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2437227Z     
2025-05-07T20:33:02.2437425Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2437729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2438059Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2438302Z         x0 = x[:, :D]
2025-05-07T20:33:02.2438534Z         x1 = x[:, D:]
2025-05-07T20:33:02.2438757Z     
2025-05-07T20:33:02.2438946Z         if contiguous:
2025-05-07T20:33:02.2439187Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2439457Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2439705Z     
2025-05-07T20:33:02.2439905Z         if scale_ub is not None:
2025-05-07T20:33:02.2440183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2440530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2440842Z             )
2025-05-07T20:33:02.2441049Z         else:
2025-05-07T20:33:02.2441274Z             scale_ub_tensor = None
2025-05-07T20:33:02.2441534Z     
2025-05-07T20:33:02.2441779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2442106Z             op = silu_mul_quant
2025-05-07T20:33:02.2442397Z             if compiled:
2025-05-07T20:33:02.2442650Z                 op = torch.compile(op)
2025-05-07T20:33:02.2442944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2443228Z     
2025-05-07T20:33:02.2443434Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2443603Z 
2025-05-07T20:33:02.2443705Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2444008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2444353Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2444638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2445204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.2445768Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.2446431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2447123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2447672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2448362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2449025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2449564Z     kernel = self.compile(
2025-05-07T20:33:02.2450112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2450775Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2451176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2451419Z 
2025-05-07T20:33:02.2451630Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671615b0>
2025-05-07T20:33:02.2452808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2454278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf2700>}
2025-05-07T20:33:02.2455622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2456659Z context = <triton._C.libtriton.ir.context object at 0x7f8a692eec30>
2025-05-07T20:33:02.2456955Z 
2025-05-07T20:33:02.2457123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2457654Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2458130Z                            module_map=module_map)
2025-05-07T20:33:02.2458518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2458881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2459157Z E       ^
2025-05-07T20:33:02.2459624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2460086Z 
2025-05-07T20:33:02.2460505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2461018Z 
2025-05-07T20:33:02.2461132Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2461553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2461957Z     T=2048,
2025-05-07T20:33:02.2462157Z     D=7168,
2025-05-07T20:33:02.2462363Z     scale_ub=None,
2025-05-07T20:33:02.2462584Z     contiguous=True,
2025-05-07T20:33:02.2463005Z     compiled=True,
2025-05-07T20:33:02.2463228Z )
2025-05-07T20:33:02.2463557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2464063Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:02.2464339Z 
2025-05-07T20:33:02.2464425Z     @given(
2025-05-07T20:33:02.2464659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2464986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2465303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2465796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2466124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2466415Z     )
2025-05-07T20:33:02.2466769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2467209Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2467461Z         self,
2025-05-07T20:33:02.2467663Z         T: int,
2025-05-07T20:33:02.2467858Z         D: int,
2025-05-07T20:33:02.2468100Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2468381Z         contiguous: bool,
2025-05-07T20:33:02.2468619Z         compiled: bool,
2025-05-07T20:33:02.2468851Z     ) -> None:
2025-05-07T20:33:02.2469078Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2469319Z     
2025-05-07T20:33:02.2469597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2469949Z     
2025-05-07T20:33:02.2470145Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2470459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2470777Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2471021Z         x0 = x[:, :D]
2025-05-07T20:33:02.2471248Z         x1 = x[:, D:]
2025-05-07T20:33:02.2471468Z     
2025-05-07T20:33:02.2471655Z         if contiguous:
2025-05-07T20:33:02.2471903Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2472174Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2472422Z     
2025-05-07T20:33:02.2472778Z         if scale_ub is not None:
2025-05-07T20:33:02.2473112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2473460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2473836Z             )
2025-05-07T20:33:02.2474040Z         else:
2025-05-07T20:33:02.2474262Z             scale_ub_tensor = None
2025-05-07T20:33:02.2474520Z     
2025-05-07T20:33:02.2474765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2475098Z             op = silu_mul_quant
2025-05-07T20:33:02.2475353Z             if compiled:
2025-05-07T20:33:02.2475609Z                 op = torch.compile(op)
2025-05-07T20:33:02.2484678Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2484981Z     
2025-05-07T20:33:02.2485186Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2485363Z 
2025-05-07T20:33:02.2485468Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2485775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2486134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2486422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2487012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.2487590Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.2488260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2488969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2489524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2490231Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2490900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2491442Z     kernel = self.compile(
2025-05-07T20:33:02.2492000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2492669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2493076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2493318Z 
2025-05-07T20:33:02.2493529Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66e5edb0>
2025-05-07T20:33:02.2494630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2496024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf37e0>}
2025-05-07T20:33:02.2497379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2498425Z context = <triton._C.libtriton.ir.context object at 0x7f8a692e7b70>
2025-05-07T20:33:02.2498734Z 
2025-05-07T20:33:02.2498936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2499506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2499981Z                            module_map=module_map)
2025-05-07T20:33:02.2500357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2500723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2500986Z E       ^
2025-05-07T20:33:02.2501459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2501921Z 
2025-05-07T20:33:02.2502450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2503001Z 
2025-05-07T20:33:02.3144929Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3146120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3147258Z     T=16384,
2025-05-07T20:33:02.3147661Z     D=5120,
2025-05-07T20:33:02.3148068Z     scale_ub=None,
2025-05-07T20:33:02.3148517Z     contiguous=False,
2025-05-07T20:33:02.3148868Z     compiled=False,
2025-05-07T20:33:02.3149094Z )
2025-05-07T20:33:02.3149430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3149937Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.3150229Z 
2025-05-07T20:33:02.3150311Z     @given(
2025-05-07T20:33:02.3150556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3150881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3151201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3151549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3151894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3152187Z     )
2025-05-07T20:33:02.3152549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3153011Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3153260Z         self,
2025-05-07T20:33:02.3153472Z         T: int,
2025-05-07T20:33:02.3153685Z         D: int,
2025-05-07T20:33:02.3153910Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3154194Z         contiguous: bool,
2025-05-07T20:33:02.3154451Z         compiled: bool,
2025-05-07T20:33:02.3154690Z     ) -> None:
2025-05-07T20:33:02.3154911Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3155166Z     
2025-05-07T20:33:02.3155462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3155938Z     
2025-05-07T20:33:02.3156154Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3156461Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3158498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.3160450Z 
2025-05-07T20:33:02.3160583Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:02.3160801Z 
2025-05-07T20:33:02.3160911Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3161341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3161768Z     T=4096,
2025-05-07T20:33:02.3161974Z     D=7168,
2025-05-07T20:33:02.3162173Z     scale_ub=1200.0,
2025-05-07T20:33:02.3162410Z     contiguous=True,
2025-05-07T20:33:02.3162646Z     compiled=True,
2025-05-07T20:33:02.3162858Z )
2025-05-07T20:33:02.3163196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3163700Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.3163976Z 
2025-05-07T20:33:02.3164060Z     @given(
2025-05-07T20:33:02.3164304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3164629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3164947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3165292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3165946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3166241Z     )
2025-05-07T20:33:02.3166764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3167338Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3167593Z         self,
2025-05-07T20:33:02.3167797Z         T: int,
2025-05-07T20:33:02.3168067Z         D: int,
2025-05-07T20:33:02.3168298Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3168573Z         contiguous: bool,
2025-05-07T20:33:02.3168838Z         compiled: bool,
2025-05-07T20:33:02.3169117Z     ) -> None:
2025-05-07T20:33:02.3169338Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3169600Z     
2025-05-07T20:33:02.3169883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3170232Z     
2025-05-07T20:33:02.3170440Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3170740Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3172774Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.3174656Z 
2025-05-07T20:33:02.3174787Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:02.3175007Z 
2025-05-07T20:33:02.3175116Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3175540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3175960Z     T=16384,
2025-05-07T20:33:02.3176165Z     D=7168,
2025-05-07T20:33:02.3176370Z     scale_ub=None,
2025-05-07T20:33:02.3176605Z     contiguous=False,
2025-05-07T20:33:02.3176836Z     compiled=False,
2025-05-07T20:33:02.3177053Z )
2025-05-07T20:33:02.3177394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3177907Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.3178194Z 
2025-05-07T20:33:02.3178280Z     @given(
2025-05-07T20:33:02.3178525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3178884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3179225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3179569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3179915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3180213Z     )
2025-05-07T20:33:02.3180576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3181020Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3181273Z         self,
2025-05-07T20:33:02.3181481Z         T: int,
2025-05-07T20:33:02.3181683Z         D: int,
2025-05-07T20:33:02.3181917Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3182199Z         contiguous: bool,
2025-05-07T20:33:02.3182452Z         compiled: bool,
2025-05-07T20:33:02.3182680Z     ) -> None:
2025-05-07T20:33:02.3182912Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3183171Z     
2025-05-07T20:33:02.3183450Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3185529Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.3187416Z 
2025-05-07T20:33:02.3187661Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.3187878Z 
2025-05-07T20:33:02.3187991Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3188420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3188868Z     T=2048,
2025-05-07T20:33:02.3189068Z     D=7168,
2025-05-07T20:33:02.3189275Z     scale_ub=1200.0,
2025-05-07T20:33:02.3189502Z     contiguous=True,
2025-05-07T20:33:02.3189738Z     compiled=True,
2025-05-07T20:33:02.3189948Z )
2025-05-07T20:33:02.3190269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3190770Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.3191044Z 
2025-05-07T20:33:02.3191133Z     @given(
2025-05-07T20:33:02.3191364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3191686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3192003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3192346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3192687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3192983Z     )
2025-05-07T20:33:02.3193346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3193790Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3194041Z         self,
2025-05-07T20:33:02.3194246Z         T: int,
2025-05-07T20:33:02.3194447Z         D: int,
2025-05-07T20:33:02.3194676Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3194955Z         contiguous: bool,
2025-05-07T20:33:02.3195197Z         compiled: bool,
2025-05-07T20:33:02.3195429Z     ) -> None:
2025-05-07T20:33:02.3195653Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3195965Z     
2025-05-07T20:33:02.3196251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3196602Z     
2025-05-07T20:33:02.3196800Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3197107Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3199166Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.3201030Z 
2025-05-07T20:33:02.3201151Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:02.3201370Z 
2025-05-07T20:33:02.3201492Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3201909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3202322Z     T=2048,
2025-05-07T20:33:02.3202526Z     D=7168,
2025-05-07T20:33:02.3202722Z     scale_ub=None,
2025-05-07T20:33:02.3202945Z     contiguous=True,
2025-05-07T20:33:02.3203179Z     compiled=False,
2025-05-07T20:33:02.3203388Z )
2025-05-07T20:33:02.4351187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4351730Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.4352076Z 
2025-05-07T20:33:02.4352192Z     @given(
2025-05-07T20:33:02.4352503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4352816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4353130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4353464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4353792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4354082Z     )
2025-05-07T20:33:02.4354745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4355279Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4355521Z         self,
2025-05-07T20:33:02.4355720Z         T: int,
2025-05-07T20:33:02.4356012Z         D: int,
2025-05-07T20:33:02.4356312Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4356589Z         contiguous: bool,
2025-05-07T20:33:02.4356835Z         compiled: bool,
2025-05-07T20:33:02.4357054Z     ) -> None:
2025-05-07T20:33:02.4357275Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4357522Z     
2025-05-07T20:33:02.4357790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4358138Z     
2025-05-07T20:33:02.4358339Z >       x_sign = torch.sign(x)
2025-05-07T20:33:02.4360298Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.4362164Z 
2025-05-07T20:33:02.4362289Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:02.4362499Z 
2025-05-07T20:33:02.4362601Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4363014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4363421Z     T=1,
2025-05-07T20:33:02.4363603Z     D=7168,
2025-05-07T20:33:02.4363797Z     scale_ub=1200.0,
2025-05-07T20:33:02.4364021Z     contiguous=True,
2025-05-07T20:33:02.4364239Z     compiled=False,
2025-05-07T20:33:02.4364446Z )
2025-05-07T20:33:02.4364765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4365247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.4365791Z 
2025-05-07T20:33:02.4365872Z     @given(
2025-05-07T20:33:02.4366110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4366431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4366734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4367070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4367403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4367689Z     )
2025-05-07T20:33:02.4368039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4368483Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4368722Z         self,
2025-05-07T20:33:02.4368921Z         T: int,
2025-05-07T20:33:02.4369125Z         D: int,
2025-05-07T20:33:02.4369341Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4369620Z         contiguous: bool,
2025-05-07T20:33:02.4369865Z         compiled: bool,
2025-05-07T20:33:02.4370091Z     ) -> None:
2025-05-07T20:33:02.4370314Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4370559Z     
2025-05-07T20:33:02.4370832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4371172Z     
2025-05-07T20:33:02.4371368Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.4371657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.4371965Z         x = x_sign * x_clamp
2025-05-07T20:33:02.4372210Z         x0 = x[:, :D]
2025-05-07T20:33:02.4372428Z         x1 = x[:, D:]
2025-05-07T20:33:02.4372637Z     
2025-05-07T20:33:02.4372829Z         if contiguous:
2025-05-07T20:33:02.4373061Z             x0 = x0.contiguous()
2025-05-07T20:33:02.4373322Z             x1 = x1.contiguous()
2025-05-07T20:33:02.4373567Z     
2025-05-07T20:33:02.4373764Z         if scale_ub is not None:
2025-05-07T20:33:02.4374037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.4374514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.4374883Z             )
2025-05-07T20:33:02.4375074Z         else:
2025-05-07T20:33:02.4375302Z             scale_ub_tensor = None
2025-05-07T20:33:02.4375555Z     
2025-05-07T20:33:02.4375851Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.4376164Z             op = silu_mul_quant
2025-05-07T20:33:02.4376416Z             if compiled:
2025-05-07T20:33:02.4376667Z                 op = torch.compile(op)
2025-05-07T20:33:02.4376960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4377241Z     
2025-05-07T20:33:02.4377440Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.4377609Z 
2025-05-07T20:33:02.4377720Z moe/activation_test.py:117: 
2025-05-07T20:33:02.4378012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4378349Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.4378639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4379333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.4380040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.4380583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.4381273Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.4381929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.4382466Z     kernel = self.compile(
2025-05-07T20:33:02.4383014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.4383662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.4384065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4384301Z 
2025-05-07T20:33:02.4384518Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aef8c0>
2025-05-07T20:33:02.4385605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.4386977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a692dab60>}
2025-05-07T20:33:02.4388328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.4389361Z context = <triton._C.libtriton.ir.context object at 0x7f8a69319070>
2025-05-07T20:33:02.4389650Z 
2025-05-07T20:33:02.4389833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.4390365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.4390831Z                            module_map=module_map)
2025-05-07T20:33:02.4391204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.4391565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.4391822Z E       ^
2025-05-07T20:33:02.4392289Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.4392738Z 
2025-05-07T20:33:02.4393161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.4393671Z 
2025-05-07T20:33:02.4393783Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4394193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4394596Z     T=128,
2025-05-07T20:33:02.4394795Z     D=5120,
2025-05-07T20:33:02.4395111Z     scale_ub=None,
2025-05-07T20:33:02.4395332Z     contiguous=True,
2025-05-07T20:33:02.4395555Z     compiled=False,
2025-05-07T20:33:02.4395808Z )
2025-05-07T20:33:02.5086144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.5087125Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.5087505Z 
2025-05-07T20:33:02.5087617Z     @given(
2025-05-07T20:33:02.5087931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.5088249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.5088568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.5088907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.5089234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.5089524Z     )
2025-05-07T20:33:02.5089877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.5090334Z     def test_silu_mul_quant(
2025-05-07T20:33:02.5090582Z         self,
2025-05-07T20:33:02.5090787Z         T: int,
2025-05-07T20:33:02.5090993Z         D: int,
2025-05-07T20:33:02.5091215Z         scale_ub: Optional[float],
2025-05-07T20:33:02.5091493Z         contiguous: bool,
2025-05-07T20:33:02.5091736Z         compiled: bool,
2025-05-07T20:33:02.5091958Z     ) -> None:
2025-05-07T20:33:02.5092179Z         torch.manual_seed(2025)
2025-05-07T20:33:02.5092425Z     
2025-05-07T20:33:02.5092693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.5093044Z     
2025-05-07T20:33:02.5093243Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.5093531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.5093849Z         x = x_sign * x_clamp
2025-05-07T20:33:02.5094101Z         x0 = x[:, :D]
2025-05-07T20:33:02.5094319Z         x1 = x[:, D:]
2025-05-07T20:33:02.5094535Z     
2025-05-07T20:33:02.5094730Z         if contiguous:
2025-05-07T20:33:02.5094972Z             x0 = x0.contiguous()
2025-05-07T20:33:02.5095239Z             x1 = x1.contiguous()
2025-05-07T20:33:02.5095486Z     
2025-05-07T20:33:02.5095678Z         if scale_ub is not None:
2025-05-07T20:33:02.5095962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.5096304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.5096621Z             )
2025-05-07T20:33:02.5096817Z         else:
2025-05-07T20:33:02.5097038Z             scale_ub_tensor = None
2025-05-07T20:33:02.5097300Z     
2025-05-07T20:33:02.5097534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.5097856Z             op = silu_mul_quant
2025-05-07T20:33:02.5098112Z             if compiled:
2025-05-07T20:33:02.5098357Z                 op = torch.compile(op)
2025-05-07T20:33:02.5098656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5098939Z     
2025-05-07T20:33:02.5099132Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.5099302Z 
2025-05-07T20:33:02.5099411Z moe/activation_test.py:117: 
2025-05-07T20:33:02.5099715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5100045Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.5100336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5101027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.5101715Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.5102251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.5102936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.5103599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.5104137Z     kernel = self.compile(
2025-05-07T20:33:02.5104837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.5105553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.5105992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5106222Z 
2025-05-07T20:33:02.5106427Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aecd70>
2025-05-07T20:33:02.5107507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.5108884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a692dbc40>}
2025-05-07T20:33:02.5110234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.5111261Z context = <triton._C.libtriton.ir.context object at 0x7f8a690eb8f0>
2025-05-07T20:33:02.5111555Z 
2025-05-07T20:33:02.5111724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.5112252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.5112725Z                            module_map=module_map)
2025-05-07T20:33:02.5113095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.5113448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.5113730Z E       ^
2025-05-07T20:33:02.5114194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.5114654Z 
2025-05-07T20:33:02.5115076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.5115596Z 
2025-05-07T20:33:02.5115702Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.5116216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.5116623Z     T=128,
2025-05-07T20:33:02.5116819Z     D=7168,
2025-05-07T20:33:02.5117019Z     scale_ub=None,
2025-05-07T20:33:02.5117232Z     contiguous=True,
2025-05-07T20:33:02.5117464Z     compiled=False,
2025-05-07T20:33:02.5117676Z )
2025-05-07T20:33:02.5118009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.5118500Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.5118777Z 
2025-05-07T20:33:02.5118864Z     @given(
2025-05-07T20:33:02.5119132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.5119466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.5119792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.5120132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.5129991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.5130338Z     )
2025-05-07T20:33:02.5130705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.5131161Z     def test_silu_mul_quant(
2025-05-07T20:33:02.5131416Z         self,
2025-05-07T20:33:02.5131624Z         T: int,
2025-05-07T20:33:02.5131830Z         D: int,
2025-05-07T20:33:02.5132061Z         scale_ub: Optional[float],
2025-05-07T20:33:02.5132344Z         contiguous: bool,
2025-05-07T20:33:02.5132586Z         compiled: bool,
2025-05-07T20:33:02.5132821Z     ) -> None:
2025-05-07T20:33:02.5133051Z         torch.manual_seed(2025)
2025-05-07T20:33:02.5133298Z     
2025-05-07T20:33:02.5133582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.5133936Z     
2025-05-07T20:33:02.5134133Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.5134599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.5134925Z         x = x_sign * x_clamp
2025-05-07T20:33:02.5135170Z         x0 = x[:, :D]
2025-05-07T20:33:02.5135439Z         x1 = x[:, D:]
2025-05-07T20:33:02.5135658Z     
2025-05-07T20:33:02.5135847Z         if contiguous:
2025-05-07T20:33:02.5136091Z             x0 = x0.contiguous()
2025-05-07T20:33:02.5136363Z             x1 = x1.contiguous()
2025-05-07T20:33:02.5136610Z     
2025-05-07T20:33:02.5136811Z         if scale_ub is not None:
2025-05-07T20:33:02.5137093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.5137439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.5137751Z             )
2025-05-07T20:33:02.5137960Z         else:
2025-05-07T20:33:02.5138184Z             scale_ub_tensor = None
2025-05-07T20:33:02.5138439Z     
2025-05-07T20:33:02.5138677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.5139028Z             op = silu_mul_quant
2025-05-07T20:33:02.5139305Z             if compiled:
2025-05-07T20:33:02.5139562Z                 op = torch.compile(op)
2025-05-07T20:33:02.5139870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5140149Z     
2025-05-07T20:33:02.5140353Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.5140525Z 
2025-05-07T20:33:02.5140638Z moe/activation_test.py:117: 
2025-05-07T20:33:02.5140940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5141283Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.5141576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.5142273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.5142964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.5143507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.5144206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.5144883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.5145423Z     kernel = self.compile(
2025-05-07T20:33:02.5145972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.5146639Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.5147039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.5147284Z 
2025-05-07T20:33:02.5147492Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a690442c0>
2025-05-07T20:33:02.5148592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.5149974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69074ae0>}
2025-05-07T20:33:02.5151332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.5152356Z context = <triton._C.libtriton.ir.context object at 0x7f8a690104f0>
2025-05-07T20:33:02.5152653Z 
2025-05-07T20:33:02.5152823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.5153355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.5153834Z                            module_map=module_map)
2025-05-07T20:33:02.5154202Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.5154689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.5154963Z E       ^
2025-05-07T20:33:02.5155431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.5156020Z 
2025-05-07T20:33:02.5156438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.5156961Z 
2025-05-07T20:33:02.5157069Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.5157493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.5157898Z     T=2048,
2025-05-07T20:33:02.5158099Z     D=7168,
2025-05-07T20:33:02.5158309Z     scale_ub=1200.0,
2025-05-07T20:33:02.5158536Z     contiguous=True,
2025-05-07T20:33:02.5158772Z     compiled=False,
2025-05-07T20:33:02.5158985Z )
2025-05-07T20:33:02.5977895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.5978661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.5979082Z 
2025-05-07T20:33:02.5979215Z     @given(
2025-05-07T20:33:02.5979529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.5979860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.5980179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.5980522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.5980861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.5981164Z     )
2025-05-07T20:33:02.5981526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.5981978Z     def test_silu_mul_quant(
2025-05-07T20:33:02.5982226Z         self,
2025-05-07T20:33:02.5982432Z         T: int,
2025-05-07T20:33:02.5982642Z         D: int,
2025-05-07T20:33:02.5982867Z         scale_ub: Optional[float],
2025-05-07T20:33:02.5983154Z         contiguous: bool,
2025-05-07T20:33:02.5983400Z         compiled: bool,
2025-05-07T20:33:02.5983635Z     ) -> None:
2025-05-07T20:33:02.5983858Z         torch.manual_seed(2025)
2025-05-07T20:33:02.5984110Z     
2025-05-07T20:33:02.5984384Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.5986462Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.5988343Z 
2025-05-07T20:33:02.5988468Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.5988690Z 
2025-05-07T20:33:02.5988801Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.5989278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.5989684Z     T=1,
2025-05-07T20:33:02.5989879Z     D=5120,
2025-05-07T20:33:02.5990088Z     scale_ub=1200.0,
2025-05-07T20:33:02.5990313Z     contiguous=True,
2025-05-07T20:33:02.5990543Z     compiled=False,
2025-05-07T20:33:02.5990762Z )
2025-05-07T20:33:02.5991084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.5991587Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.5991852Z 
2025-05-07T20:33:02.5991932Z     @given(
2025-05-07T20:33:02.5992170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.5992491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.5992808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.5993151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.5993760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.5994113Z     )
2025-05-07T20:33:02.5994467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.5994983Z     def test_silu_mul_quant(
2025-05-07T20:33:02.5995235Z         self,
2025-05-07T20:33:02.5995429Z         T: int,
2025-05-07T20:33:02.5995630Z         D: int,
2025-05-07T20:33:02.5995952Z         scale_ub: Optional[float],
2025-05-07T20:33:02.5996233Z         contiguous: bool,
2025-05-07T20:33:02.5996474Z         compiled: bool,
2025-05-07T20:33:02.5996704Z     ) -> None:
2025-05-07T20:33:02.5996937Z         torch.manual_seed(2025)
2025-05-07T20:33:02.5997178Z     
2025-05-07T20:33:02.5997454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.5997803Z     
2025-05-07T20:33:02.5998004Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.5998304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.5998622Z         x = x_sign * x_clamp
2025-05-07T20:33:02.5998880Z         x0 = x[:, :D]
2025-05-07T20:33:02.5999104Z         x1 = x[:, D:]
2025-05-07T20:33:02.5999318Z     
2025-05-07T20:33:02.5999505Z         if contiguous:
2025-05-07T20:33:02.5999750Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6000017Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6000262Z     
2025-05-07T20:33:02.6000471Z         if scale_ub is not None:
2025-05-07T20:33:02.6000751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6001094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6001406Z             )
2025-05-07T20:33:02.6001609Z         else:
2025-05-07T20:33:02.6001827Z             scale_ub_tensor = None
2025-05-07T20:33:02.6002078Z     
2025-05-07T20:33:02.6002321Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6002644Z             op = silu_mul_quant
2025-05-07T20:33:02.6002900Z             if compiled:
2025-05-07T20:33:02.6003155Z                 op = torch.compile(op)
2025-05-07T20:33:02.6003463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6003740Z     
2025-05-07T20:33:02.6003941Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6004110Z 
2025-05-07T20:33:02.6004222Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6004519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6004865Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6005163Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6005860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6006548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6007091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6007775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6008451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6008987Z     kernel = self.compile(
2025-05-07T20:33:02.6009533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6010196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6010598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6010835Z 
2025-05-07T20:33:02.6011043Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a690452b0>
2025-05-07T20:33:02.6012130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6013608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a690760c0>}
2025-05-07T20:33:02.6015020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6016089Z context = <triton._C.libtriton.ir.context object at 0x7f8a690c86b0>
2025-05-07T20:33:02.6016390Z 
2025-05-07T20:33:02.6016562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6017098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6017574Z                            module_map=module_map)
2025-05-07T20:33:02.6017947Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6018323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6018597Z E       ^
2025-05-07T20:33:02.6019125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6019594Z 
2025-05-07T20:33:02.6020009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6020530Z 
2025-05-07T20:33:02.6020642Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6021066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6021474Z     T=2048,
2025-05-07T20:33:02.6021675Z     D=5120,
2025-05-07T20:33:02.6021886Z     scale_ub=None,
2025-05-07T20:33:02.6022104Z     contiguous=True,
2025-05-07T20:33:02.6022339Z     compiled=False,
2025-05-07T20:33:02.6022555Z )
2025-05-07T20:33:02.6022878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6023379Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.6023653Z 
2025-05-07T20:33:02.6023741Z     @given(
2025-05-07T20:33:02.6023982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6024307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6024635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6024975Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6025307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6025604Z     )
2025-05-07T20:33:02.6025961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6026405Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6026659Z         self,
2025-05-07T20:33:02.6026862Z         T: int,
2025-05-07T20:33:02.6027064Z         D: int,
2025-05-07T20:33:02.6027295Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6027570Z         contiguous: bool,
2025-05-07T20:33:02.6027811Z         compiled: bool,
2025-05-07T20:33:02.6028044Z     ) -> None:
2025-05-07T20:33:02.6028266Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6028517Z     
2025-05-07T20:33:02.6028798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6029175Z     
2025-05-07T20:33:02.6029397Z >       x_sign = torch.sign(x)
2025-05-07T20:33:02.6031359Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6033223Z 
2025-05-07T20:33:02.6033344Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:02.6033572Z 
2025-05-07T20:33:02.6033679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6034224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6034630Z     T=16384,
2025-05-07T20:33:02.6034833Z     D=5120,
2025-05-07T20:33:02.6035038Z     scale_ub=None,
2025-05-07T20:33:02.6035299Z     contiguous=True,
2025-05-07T20:33:02.6035530Z     compiled=False,
2025-05-07T20:33:02.6035845Z )
2025-05-07T20:33:02.6801945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6802676Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.6803076Z 
2025-05-07T20:33:02.6803191Z     @given(
2025-05-07T20:33:02.6803516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6803890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6804207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6804546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6804885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6805199Z     )
2025-05-07T20:33:02.6805560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6806017Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6806276Z         self,
2025-05-07T20:33:02.6806485Z         T: int,
2025-05-07T20:33:02.6806700Z         D: int,
2025-05-07T20:33:02.6806925Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6807210Z         contiguous: bool,
2025-05-07T20:33:02.6807458Z         compiled: bool,
2025-05-07T20:33:02.6807689Z     ) -> None:
2025-05-07T20:33:02.6807915Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6808165Z     
2025-05-07T20:33:02.6808471Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6810547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6812420Z 
2025-05-07T20:33:02.6812547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.6812762Z 
2025-05-07T20:33:02.6812874Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6813288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6813696Z     T=4096,
2025-05-07T20:33:02.6813892Z     D=5120,
2025-05-07T20:33:02.6814089Z     scale_ub=None,
2025-05-07T20:33:02.6814309Z     contiguous=True,
2025-05-07T20:33:02.6814541Z     compiled=False,
2025-05-07T20:33:02.6814750Z )
2025-05-07T20:33:02.6815084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6815594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.6815866Z 
2025-05-07T20:33:02.6815949Z     @given(
2025-05-07T20:33:02.6816185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6816508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6816821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6817160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6817494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6817787Z     )
2025-05-07T20:33:02.6818137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6818587Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6818846Z         self,
2025-05-07T20:33:02.6819059Z         T: int,
2025-05-07T20:33:02.6819295Z         D: int,
2025-05-07T20:33:02.6819526Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6819803Z         contiguous: bool,
2025-05-07T20:33:02.6820433Z         compiled: bool,
2025-05-07T20:33:02.6820674Z     ) -> None:
2025-05-07T20:33:02.6820898Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6821158Z     
2025-05-07T20:33:02.6821442Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6823564Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6825420Z 
2025-05-07T20:33:02.6825547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.6825769Z 
2025-05-07T20:33:02.6825882Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6826306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6826728Z     T=2048,
2025-05-07T20:33:02.6826934Z     D=5120,
2025-05-07T20:33:02.6827133Z     scale_ub=None,
2025-05-07T20:33:02.6827362Z     contiguous=False,
2025-05-07T20:33:02.6827607Z     compiled=False,
2025-05-07T20:33:02.6827817Z )
2025-05-07T20:33:02.6828149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6828652Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.6828928Z 
2025-05-07T20:33:02.6829011Z     @given(
2025-05-07T20:33:02.6829249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6829568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6829878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6830222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6830565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6830860Z     )
2025-05-07T20:33:02.6831208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6831658Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6831911Z         self,
2025-05-07T20:33:02.6832108Z         T: int,
2025-05-07T20:33:02.6832315Z         D: int,
2025-05-07T20:33:02.6832546Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6832816Z         contiguous: bool,
2025-05-07T20:33:02.6833063Z         compiled: bool,
2025-05-07T20:33:02.6833290Z     ) -> None:
2025-05-07T20:33:02.6833512Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6833762Z     
2025-05-07T20:33:02.6834044Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6836183Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6838043Z 
2025-05-07T20:33:02.6838176Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.6838388Z 
2025-05-07T20:33:02.6838497Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6838917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6839333Z     T=4096,
2025-05-07T20:33:02.6839526Z     D=7168,
2025-05-07T20:33:02.6839738Z     scale_ub=None,
2025-05-07T20:33:02.6839967Z     contiguous=True,
2025-05-07T20:33:02.6840194Z     compiled=True,
2025-05-07T20:33:02.6840415Z )
2025-05-07T20:33:02.6840835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6841367Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:02.6841652Z 
2025-05-07T20:33:02.6841775Z     @given(
2025-05-07T20:33:02.6842021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6842353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6842661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6843000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6843337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6843634Z     )
2025-05-07T20:33:02.6843992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6844445Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6844698Z         self,
2025-05-07T20:33:02.6844907Z         T: int,
2025-05-07T20:33:02.6845121Z         D: int,
2025-05-07T20:33:02.6845346Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6845626Z         contiguous: bool,
2025-05-07T20:33:02.6845876Z         compiled: bool,
2025-05-07T20:33:02.6846109Z     ) -> None:
2025-05-07T20:33:02.6846325Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6846576Z     
2025-05-07T20:33:02.6846855Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6848915Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6850834Z 
2025-05-07T20:33:02.6850963Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.6851184Z 
2025-05-07T20:33:02.6851291Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6851715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6852132Z     T=2048,
2025-05-07T20:33:02.6852321Z     D=5120,
2025-05-07T20:33:02.6852518Z     scale_ub=1200.0,
2025-05-07T20:33:02.6852752Z     contiguous=False,
2025-05-07T20:33:02.6852981Z     compiled=False,
2025-05-07T20:33:02.6853192Z )
2025-05-07T20:33:02.6853531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6854032Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.6854323Z 
2025-05-07T20:33:02.6854405Z     @given(
2025-05-07T20:33:02.6854652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6854972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6855293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6855637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6855974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6856273Z     )
2025-05-07T20:33:02.6856636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6857090Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6857336Z         self,
2025-05-07T20:33:02.6857540Z         T: int,
2025-05-07T20:33:02.6857748Z         D: int,
2025-05-07T20:33:02.6857968Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6858246Z         contiguous: bool,
2025-05-07T20:33:02.6858504Z         compiled: bool,
2025-05-07T20:33:02.6858728Z     ) -> None:
2025-05-07T20:33:02.6858960Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6859218Z     
2025-05-07T20:33:02.6859493Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6861663Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.6863695Z 
2025-05-07T20:33:02.6873549Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.6873813Z 
2025-05-07T20:33:02.6873926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6874352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6874769Z     T=4096,
2025-05-07T20:33:02.6874965Z     D=7168,
2025-05-07T20:33:02.6875167Z     scale_ub=1200.0,
2025-05-07T20:33:02.6875405Z     contiguous=True,
2025-05-07T20:33:02.6875638Z     compiled=False,
2025-05-07T20:33:02.6875941Z )
2025-05-07T20:33:02.7944879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7946317Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.7947101Z 
2025-05-07T20:33:02.7947395Z     @given(
2025-05-07T20:33:02.7948021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.7948703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.7949195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.7949574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.7949913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.7950207Z     )
2025-05-07T20:33:02.7950574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.7951018Z     def test_silu_mul_quant(
2025-05-07T20:33:02.7951269Z         self,
2025-05-07T20:33:02.7951476Z         T: int,
2025-05-07T20:33:02.7951693Z         D: int,
2025-05-07T20:33:02.7951924Z         scale_ub: Optional[float],
2025-05-07T20:33:02.7952205Z         contiguous: bool,
2025-05-07T20:33:02.7952446Z         compiled: bool,
2025-05-07T20:33:02.7952679Z     ) -> None:
2025-05-07T20:33:02.7952904Z         torch.manual_seed(2025)
2025-05-07T20:33:02.7953149Z     
2025-05-07T20:33:02.7953435Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.7955527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.7957500Z 
2025-05-07T20:33:02.7957631Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.7957845Z 
2025-05-07T20:33:02.7957961Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.7958378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.7958792Z     T=16384,
2025-05-07T20:33:02.7958993Z     D=7168,
2025-05-07T20:33:02.7959186Z     scale_ub=None,
2025-05-07T20:33:02.7959408Z     contiguous=False,
2025-05-07T20:33:02.7959644Z     compiled=True,
2025-05-07T20:33:02.7959846Z )
2025-05-07T20:33:02.7960169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7960676Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.7960955Z 
2025-05-07T20:33:02.7961044Z     @given(
2025-05-07T20:33:02.7961274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.7961596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.7962311Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.7962645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.7962988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.7963351Z     )
2025-05-07T20:33:02.7963700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.7964153Z     def test_silu_mul_quant(
2025-05-07T20:33:02.7964409Z         self,
2025-05-07T20:33:02.7964606Z         T: int,
2025-05-07T20:33:02.7964815Z         D: int,
2025-05-07T20:33:02.7965042Z         scale_ub: Optional[float],
2025-05-07T20:33:02.7965318Z         contiguous: bool,
2025-05-07T20:33:02.7965924Z         compiled: bool,
2025-05-07T20:33:02.7966160Z     ) -> None:
2025-05-07T20:33:02.7966379Z         torch.manual_seed(2025)
2025-05-07T20:33:02.7966635Z     
2025-05-07T20:33:02.7966919Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.7968996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.7970866Z 
2025-05-07T20:33:02.7970993Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.7971206Z 
2025-05-07T20:33:02.7971314Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.7971739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.7972157Z     T=4096,
2025-05-07T20:33:02.7972356Z     D=7168,
2025-05-07T20:33:02.7972550Z     scale_ub=None,
2025-05-07T20:33:02.7972785Z     contiguous=True,
2025-05-07T20:33:02.7973018Z     compiled=False,
2025-05-07T20:33:02.7973227Z )
2025-05-07T20:33:02.7973556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7974066Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.7974345Z 
2025-05-07T20:33:02.7974428Z     @given(
2025-05-07T20:33:02.7974672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.7974999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.7975315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.7975656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.7975998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.7976296Z     )
2025-05-07T20:33:02.7976647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.7977098Z     def test_silu_mul_quant(
2025-05-07T20:33:02.7977354Z         self,
2025-05-07T20:33:02.7977549Z         T: int,
2025-05-07T20:33:02.7977756Z         D: int,
2025-05-07T20:33:02.7977979Z         scale_ub: Optional[float],
2025-05-07T20:33:02.7978251Z         contiguous: bool,
2025-05-07T20:33:02.7978499Z         compiled: bool,
2025-05-07T20:33:02.7978734Z     ) -> None:
2025-05-07T20:33:02.7978951Z         torch.manual_seed(2025)
2025-05-07T20:33:02.7979205Z     
2025-05-07T20:33:02.7979482Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.7981670Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.7983631Z 
2025-05-07T20:33:02.7983759Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.7984029Z 
2025-05-07T20:33:02.7984134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.7984558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.7984974Z     T=16384,
2025-05-07T20:33:02.7985176Z     D=7168,
2025-05-07T20:33:02.7985383Z     scale_ub=None,
2025-05-07T20:33:02.7985606Z     contiguous=True,
2025-05-07T20:33:02.7985830Z     compiled=False,
2025-05-07T20:33:02.7986043Z )
2025-05-07T20:33:02.7986371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7986867Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.7987156Z 
2025-05-07T20:33:02.7987239Z     @given(
2025-05-07T20:33:02.7987489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.7987818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.7988127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.7988467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.7988814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.7989106Z     )
2025-05-07T20:33:02.7989464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.7989916Z     def test_silu_mul_quant(
2025-05-07T20:33:02.7990165Z         self,
2025-05-07T20:33:02.7990374Z         T: int,
2025-05-07T20:33:02.7990584Z         D: int,
2025-05-07T20:33:02.7990812Z         scale_ub: Optional[float],
2025-05-07T20:33:02.7991087Z         contiguous: bool,
2025-05-07T20:33:02.7991329Z         compiled: bool,
2025-05-07T20:33:02.7991554Z     ) -> None:
2025-05-07T20:33:02.7991777Z         torch.manual_seed(2025)
2025-05-07T20:33:02.7992019Z     
2025-05-07T20:33:02.7992299Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.7994353Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.7996313Z 
2025-05-07T20:33:02.7996432Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.7996644Z 
2025-05-07T20:33:02.7996757Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.7997171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.7997579Z     T=16384,
2025-05-07T20:33:02.7997786Z     D=7168,
2025-05-07T20:33:02.7997979Z     scale_ub=1200.0,
2025-05-07T20:33:02.7998208Z     contiguous=True,
2025-05-07T20:33:02.7998434Z     compiled=False,
2025-05-07T20:33:02.7998640Z )
2025-05-07T20:33:02.7998962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.7999468Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.7999748Z 
2025-05-07T20:33:02.7999839Z     @given(
2025-05-07T20:33:02.8000066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8000388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8000699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8001027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8001362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8001654Z     )
2025-05-07T20:33:02.8002004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8002580Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8002830Z         self,
2025-05-07T20:33:02.8003035Z         T: int,
2025-05-07T20:33:02.8003237Z         D: int,
2025-05-07T20:33:02.8003501Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8003785Z         contiguous: bool,
2025-05-07T20:33:02.8004032Z         compiled: bool,
2025-05-07T20:33:02.8004266Z     ) -> None:
2025-05-07T20:33:02.8004493Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8004738Z     
2025-05-07T20:33:02.8005016Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8007083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.8008951Z 
2025-05-07T20:33:02.8009081Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.8009298Z 
2025-05-07T20:33:02.8009413Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8009827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8010240Z     T=128,
2025-05-07T20:33:02.8010438Z     D=5120,
2025-05-07T20:33:02.8010633Z     scale_ub=1200.0,
2025-05-07T20:33:02.8010864Z     contiguous=False,
2025-05-07T20:33:02.8011096Z     compiled=False,
2025-05-07T20:33:02.8011304Z )
2025-05-07T20:33:02.9297765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9298519Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.9298904Z 
2025-05-07T20:33:02.9299090Z     @given(
2025-05-07T20:33:02.9299333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9299646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9299952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9300293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9300616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9300907Z     )
2025-05-07T20:33:02.9301256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9301692Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9301940Z         self,
2025-05-07T20:33:02.9302143Z         T: int,
2025-05-07T20:33:02.9302338Z         D: int,
2025-05-07T20:33:02.9302557Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9302830Z         contiguous: bool,
2025-05-07T20:33:02.9303070Z         compiled: bool,
2025-05-07T20:33:02.9303299Z     ) -> None:
2025-05-07T20:33:02.9303517Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9303770Z     
2025-05-07T20:33:02.9304037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9304384Z     
2025-05-07T20:33:02.9304584Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9304872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9305187Z         x = x_sign * x_clamp
2025-05-07T20:33:02.9305429Z         x0 = x[:, :D]
2025-05-07T20:33:02.9305644Z         x1 = x[:, D:]
2025-05-07T20:33:02.9305857Z     
2025-05-07T20:33:02.9306049Z         if contiguous:
2025-05-07T20:33:02.9306279Z             x0 = x0.contiguous()
2025-05-07T20:33:02.9306550Z             x1 = x1.contiguous()
2025-05-07T20:33:02.9306797Z     
2025-05-07T20:33:02.9306995Z         if scale_ub is not None:
2025-05-07T20:33:02.9307275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.9307617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.9307934Z             )
2025-05-07T20:33:02.9308137Z         else:
2025-05-07T20:33:02.9308771Z             scale_ub_tensor = None
2025-05-07T20:33:02.9309032Z     
2025-05-07T20:33:02.9309297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9309701Z             op = silu_mul_quant
2025-05-07T20:33:02.9309957Z             if compiled:
2025-05-07T20:33:02.9310203Z                 op = torch.compile(op)
2025-05-07T20:33:02.9310509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9310788Z     
2025-05-07T20:33:02.9310983Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.9311154Z 
2025-05-07T20:33:02.9311255Z moe/activation_test.py:117: 
2025-05-07T20:33:02.9311553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9311880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.9312165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9312863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.9313554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.9314085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.9314771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.9315431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.9316045Z     kernel = self.compile(
2025-05-07T20:33:02.9316588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.9317246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.9317648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9317879Z 
2025-05-07T20:33:02.9318087Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a6910d580>
2025-05-07T20:33:02.9319224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.9320625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a68f04cc0>}
2025-05-07T20:33:02.9321967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.9322992Z context = <triton._C.libtriton.ir.context object at 0x7f8a68f3f5f0>
2025-05-07T20:33:02.9323278Z 
2025-05-07T20:33:02.9323444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.9323973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.9324449Z                            module_map=module_map)
2025-05-07T20:33:02.9324808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.9325173Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.9325442Z E       ^
2025-05-07T20:33:02.9325909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.9326360Z 
2025-05-07T20:33:02.9326776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9327295Z 
2025-05-07T20:33:02.9327400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9327815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9328229Z     T=2048,
2025-05-07T20:33:02.9328417Z     D=7168,
2025-05-07T20:33:02.9328622Z     scale_ub=None,
2025-05-07T20:33:02.9328846Z     contiguous=False,
2025-05-07T20:33:02.9329196Z     compiled=False,
2025-05-07T20:33:02.9329414Z )
2025-05-07T20:33:02.9329737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9330272Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.9330555Z 
2025-05-07T20:33:02.9330635Z     @given(
2025-05-07T20:33:02.9330875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9331188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9331499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9331835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9332173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9332463Z     )
2025-05-07T20:33:02.9332811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9333256Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9333495Z         self,
2025-05-07T20:33:02.9333704Z         T: int,
2025-05-07T20:33:02.9333905Z         D: int,
2025-05-07T20:33:02.9334119Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9334392Z         contiguous: bool,
2025-05-07T20:33:02.9334637Z         compiled: bool,
2025-05-07T20:33:02.9334854Z     ) -> None:
2025-05-07T20:33:02.9335077Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9335323Z     
2025-05-07T20:33:02.9335595Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9337667Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.9339535Z 
2025-05-07T20:33:02.9339655Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:02.9339872Z 
2025-05-07T20:33:02.9339982Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9340397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9340795Z     T=128,
2025-05-07T20:33:02.9340987Z     D=7168,
2025-05-07T20:33:02.9341183Z     scale_ub=1200.0,
2025-05-07T20:33:02.9341401Z     contiguous=True,
2025-05-07T20:33:02.9341625Z     compiled=True,
2025-05-07T20:33:02.9341830Z )
2025-05-07T20:33:02.9658791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9659834Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.9660597Z 
2025-05-07T20:33:02.9660815Z     @given(
2025-05-07T20:33:02.9661437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9662320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9663144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9664030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9664713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9665284Z     )
2025-05-07T20:33:02.9666305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9667178Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9667664Z         self,
2025-05-07T20:33:02.9668052Z         T: int,
2025-05-07T20:33:02.9668444Z         D: int,
2025-05-07T20:33:02.9668921Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9669322Z         contiguous: bool,
2025-05-07T20:33:02.9669612Z         compiled: bool,
2025-05-07T20:33:02.9669842Z     ) -> None:
2025-05-07T20:33:02.9670059Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9670304Z     
2025-05-07T20:33:02.9670580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9671248Z     
2025-05-07T20:33:02.9671444Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9671737Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9672120Z         x = x_sign * x_clamp
2025-05-07T20:33:02.9672359Z         x0 = x[:, :D]
2025-05-07T20:33:02.9672580Z         x1 = x[:, D:]
2025-05-07T20:33:02.9672792Z     
2025-05-07T20:33:02.9672976Z         if contiguous:
2025-05-07T20:33:02.9673211Z             x0 = x0.contiguous()
2025-05-07T20:33:02.9673473Z             x1 = x1.contiguous()
2025-05-07T20:33:02.9673713Z     
2025-05-07T20:33:02.9673909Z         if scale_ub is not None:
2025-05-07T20:33:02.9674184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.9674514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.9674831Z             )
2025-05-07T20:33:02.9675027Z         else:
2025-05-07T20:33:02.9675236Z             scale_ub_tensor = None
2025-05-07T20:33:02.9675492Z     
2025-05-07T20:33:02.9675838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9676156Z             op = silu_mul_quant
2025-05-07T20:33:02.9676412Z             if compiled:
2025-05-07T20:33:02.9676669Z                 op = torch.compile(op)
2025-05-07T20:33:02.9676974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9677247Z     
2025-05-07T20:33:02.9677447Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.9677612Z 
2025-05-07T20:33:02.9677718Z moe/activation_test.py:117: 
2025-05-07T20:33:02.9678010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9678347Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.9678633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9679189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.9679753Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.9680414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.9681114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.9681650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.9682337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.9683004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.9683540Z     kernel = self.compile(
2025-05-07T20:33:02.9684079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.9684737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.9685139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9685371Z 
2025-05-07T20:33:02.9685583Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a68e2f7d0>
2025-05-07T20:33:02.9686667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.9688048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a68f05a80>}
2025-05-07T20:33:02.9689392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.9690416Z context = <triton._C.libtriton.ir.context object at 0x7f8a68ea47f0>
2025-05-07T20:33:02.9690703Z 
2025-05-07T20:33:02.9690869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.9691532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.9692007Z                            module_map=module_map)
2025-05-07T20:33:02.9692407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.9692763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.9693026Z E       ^
2025-05-07T20:33:02.9693492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.9693941Z 
2025-05-07T20:33:02.9694354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9694869Z 
2025-05-07T20:33:02.9694974Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9695393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9695812Z     T=128,
2025-05-07T20:33:02.9696010Z     D=7168,
2025-05-07T20:33:02.9696217Z     scale_ub=1200.0,
2025-05-07T20:33:02.9696438Z     contiguous=True,
2025-05-07T20:33:02.9696666Z     compiled=False,
2025-05-07T20:33:02.9696876Z )
2025-05-07T20:33:02.9697192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9697691Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.9697970Z 
2025-05-07T20:33:02.9698051Z     @given(
2025-05-07T20:33:02.9698286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9698600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9698915Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9699255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9699582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9699871Z     )
2025-05-07T20:33:02.9700222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9700667Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9700918Z         self,
2025-05-07T20:33:02.9701119Z         T: int,
2025-05-07T20:33:02.9701322Z         D: int,
2025-05-07T20:33:02.9701536Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9701814Z         contiguous: bool,
2025-05-07T20:33:02.9712911Z         compiled: bool,
2025-05-07T20:33:02.9713156Z     ) -> None:
2025-05-07T20:33:02.9713371Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9713608Z     
2025-05-07T20:33:02.9713878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9714218Z     
2025-05-07T20:33:02.9714408Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9714694Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9716817Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.9718690Z 
2025-05-07T20:33:02.9718810Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:02.9719024Z 
2025-05-07T20:33:02.9719134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9719587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9719990Z     T=128,
2025-05-07T20:33:02.9720173Z     D=5120,
2025-05-07T20:33:02.9720358Z     scale_ub=1200.0,
2025-05-07T20:33:02.9720576Z     contiguous=True,
2025-05-07T20:33:02.9720799Z     compiled=True,
2025-05-07T20:33:02.9721010Z )
2025-05-07T20:33:02.9721344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9722002Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.9722329Z 
2025-05-07T20:33:02.9722412Z     @given(
2025-05-07T20:33:02.9722652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9723020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9723329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9723667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9723999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9724278Z     )
2025-05-07T20:33:02.9724628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9725087Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9725331Z         self,
2025-05-07T20:33:02.9725537Z         T: int,
2025-05-07T20:33:02.9725744Z         D: int,
2025-05-07T20:33:02.9725963Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9726243Z         contiguous: bool,
2025-05-07T20:33:02.9726504Z         compiled: bool,
2025-05-07T20:33:02.9726737Z     ) -> None:
2025-05-07T20:33:02.9726955Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9727206Z     
2025-05-07T20:33:02.9727487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9727833Z     
2025-05-07T20:33:02.9728035Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9728331Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9730384Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:02.9732251Z 
2025-05-07T20:33:02.9732375Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:02.9732598Z 
2025-05-07T20:33:02.9732705Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9733127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9733536Z     T=128,
2025-05-07T20:33:02.9733725Z     D=7168,
2025-05-07T20:33:02.9733925Z     scale_ub=None,
2025-05-07T20:33:02.9734146Z     contiguous=True,
2025-05-07T20:33:02.9734368Z     compiled=True,
2025-05-07T20:33:02.9734577Z )
2025-05-07T20:33:03.4556922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4557654Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.4557928Z 
2025-05-07T20:33:03.4558011Z     @given(
2025-05-07T20:33:03.4558256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4558579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4558925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4559255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4559637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4559933Z     )
2025-05-07T20:33:03.4560279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4560725Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4560974Z         self,
2025-05-07T20:33:03.4561174Z         T: int,
2025-05-07T20:33:03.4561380Z         D: int,
2025-05-07T20:33:03.4561608Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4561880Z         contiguous: bool,
2025-05-07T20:33:03.4562127Z         compiled: bool,
2025-05-07T20:33:03.4562359Z     ) -> None:
2025-05-07T20:33:03.4562574Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4562821Z     
2025-05-07T20:33:03.4563098Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4565807Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.4567862Z 
2025-05-07T20:33:03.4567984Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.4568196Z 
2025-05-07T20:33:03.4601789Z FAILED
2025-05-07T20:33:03.4601967Z 
2025-05-07T20:33:03.4602405Z =================================== FAILURES ===================================
2025-05-07T20:33:03.4603040Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:03.4603686Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:03.4604530Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:03.4605304Z   |     yield
2025-05-07T20:33:03.4605908Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:03.4606625Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:03.4607390Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:03.4608158Z   |     if method() is not None:
2025-05-07T20:33:03.4608518Z   |        ^^^^^^^^
2025-05-07T20:33:03.4609406Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:03.4610431Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4610853Z   |            ^^^^^^^
2025-05-07T20:33:03.4611622Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:03.4612478Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:03.4613059Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:03.4613638Z   +-+---------------- 1 ----------------
2025-05-07T20:33:03.4614038Z     | Traceback (most recent call last):
2025-05-07T20:33:03.4615013Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:03.4616095Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4616611Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4619787Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.4622536Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:03.4623135Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4623692Z     |     T=2048,
2025-05-07T20:33:03.4624014Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:03.4624478Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:03.4624968Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:03.4625479Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:03.4626145Z     | )
2025-05-07T20:33:03.4626404Z     | 
2025-05-07T20:33:03.4627118Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:03.4628008Z     +---------------- 2 ----------------
2025-05-07T20:33:03.4628406Z     | Traceback (most recent call last):
2025-05-07T20:33:03.4629378Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:03.4630493Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4631003Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4633649Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.4635639Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:03.4636155Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4636569Z     |     T=128,
2025-05-07T20:33:03.4636776Z     |     D=7168,
2025-05-07T20:33:03.4636982Z     |     scale_ub=None,
2025-05-07T20:33:03.4637225Z     |     contiguous=True,
2025-05-07T20:33:03.4637475Z     |     compiled=True,
2025-05-07T20:33:03.4637705Z     | )
2025-05-07T20:33:03.4637885Z     | 
2025-05-07T20:33:03.4638420Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:03.4639029Z     +---------------- 3 ----------------
2025-05-07T20:33:03.4639316Z     | Traceback (most recent call last):
2025-05-07T20:33:03.4640022Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:03.4640805Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4641183Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4643162Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.4645127Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:03.4645573Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4645982Z     |     T=128,
2025-05-07T20:33:03.4646183Z     |     D=5120,
2025-05-07T20:33:03.4646396Z     |     scale_ub=1200.0,
2025-05-07T20:33:03.4646639Z     |     contiguous=True,
2025-05-07T20:33:03.4646877Z     |     compiled=True,
2025-05-07T20:33:03.4647103Z     | )
2025-05-07T20:33:03.4647288Z     | 
2025-05-07T20:33:03.4647811Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:03.4648413Z     +---------------- 4 ----------------
2025-05-07T20:33:03.4648703Z     | Traceback (most recent call last):
2025-05-07T20:33:03.4649504Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:03.4650246Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.4650577Z     |                              ^^^^^^^^
2025-05-07T20:33:03.4651215Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:03.4651911Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4652247Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4653042Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:03.4653836Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.4654443Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:03.4655179Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4655625Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4656268Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:03.4657036Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.4657513Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4658152Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:03.4658878Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.4659399Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4660263Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:03.4661035Z     |     fn()
2025-05-07T20:33:03.4661801Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:03.4662656Z     |     self.fn.run(
2025-05-07T20:33:03.4663380Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:03.4664185Z     |     kernel = self.compile(
2025-05-07T20:33:03.4664545Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:03.4665776Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:03.4666754Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4667302Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4668173Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:03.4669258Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4669924Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.4670453Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4670942Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.4671327Z     | ^
2025-05-07T20:33:03.4671979Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4672777Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:03.4673341Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:03.4674354Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4674956Z     |     T=1,  # or any other generated value
2025-05-07T20:33:03.4675473Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:03.4676070Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:03.4676577Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:03.4677078Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:03.4677502Z     | )
2025-05-07T20:33:03.4677764Z     | 
2025-05-07T20:33:03.4678480Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:03.4679322Z     +------------------------------------
2025-05-07T20:33:03.4679820Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:03.4680346Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4680928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4681486Z     T=1,
2025-05-07T20:33:03.4681754Z     D=5120,
2025-05-07T20:33:03.4682023Z     scale_ub=None,
2025-05-07T20:33:03.4682324Z     contiguous=True,
2025-05-07T20:33:03.4682639Z     compiled=True,
2025-05-07T20:33:03.4682924Z )
2025-05-07T20:33:03.4683364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4684037Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.4684404Z 
2025-05-07T20:33:03.4684517Z     @given(
2025-05-07T20:33:03.4684841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4685284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4685705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4686166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4686633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4687049Z     )
2025-05-07T20:33:03.4687531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4688151Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4688499Z         self,
2025-05-07T20:33:03.4688762Z         T: int,
2025-05-07T20:33:03.4689037Z         D: int,
2025-05-07T20:33:03.4689355Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4689769Z         contiguous: bool,
2025-05-07T20:33:03.4690101Z         compiled: bool,
2025-05-07T20:33:03.4690405Z     ) -> None:
2025-05-07T20:33:03.4690699Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4691042Z     
2025-05-07T20:33:03.4691425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4691895Z     
2025-05-07T20:33:03.4692177Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4692582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4693023Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4693368Z         x0 = x[:, :D]
2025-05-07T20:33:03.4693690Z         x1 = x[:, D:]
2025-05-07T20:33:03.4694001Z     
2025-05-07T20:33:03.4694262Z         if contiguous:
2025-05-07T20:33:03.4694594Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4694970Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4695315Z     
2025-05-07T20:33:03.4695594Z         if scale_ub is not None:
2025-05-07T20:33:03.4695973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4696431Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4696867Z             )
2025-05-07T20:33:03.4697143Z         else:
2025-05-07T20:33:03.4697438Z             scale_ub_tensor = None
2025-05-07T20:33:03.4697804Z     
2025-05-07T20:33:03.4698132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4698572Z             op = silu_mul_quant
2025-05-07T20:33:03.4698916Z             if compiled:
2025-05-07T20:33:03.4699262Z                 op = torch.compile(op)
2025-05-07T20:33:03.4699780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4700209Z     
2025-05-07T20:33:03.4700490Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.4700874Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.4701340Z     
2025-05-07T20:33:03.4701673Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4702137Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.4702545Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.4702987Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.4703473Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4703896Z     
2025-05-07T20:33:03.4704178Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.4704445Z 
2025-05-07T20:33:03.4704590Z moe/activation_test.py:126: 
2025-05-07T20:33:03.4705002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4705481Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.4705948Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4707031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.4708056Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.4708790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4709689Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4710595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.4711557Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.4712523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.4713369Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.4714159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.4714852Z     fn()
2025-05-07T20:33:03.4715530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.4716433Z     self.fn.run(
2025-05-07T20:33:03.4717078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4717825Z     kernel = self.compile(
2025-05-07T20:33:03.4718546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4719446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4720015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4720340Z 
2025-05-07T20:33:03.4720629Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91b764e0>
2025-05-07T20:33:03.4722064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4723885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b91f50c20>}
2025-05-07T20:33:03.4725646Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4726992Z context = <triton._C.libtriton.ir.context object at 0x7f8b923659f0>
2025-05-07T20:33:03.4727370Z 
2025-05-07T20:33:03.4727700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4728466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4729080Z                            module_map=module_map)
2025-05-07T20:33:03.4729607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4730077Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.4730427Z E       ^
2025-05-07T20:33:03.4731055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4731645Z 
2025-05-07T20:33:03.4732199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4732865Z 
2025-05-07T20:33:03.4733014Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4733552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4734084Z     T=2048,
2025-05-07T20:33:03.4734352Z     D=5120,
2025-05-07T20:33:03.4734606Z     scale_ub=1200.0,
2025-05-07T20:33:03.4734900Z     contiguous=True,
2025-05-07T20:33:03.4735195Z     compiled=False,
2025-05-07T20:33:03.4735465Z )
2025-05-07T20:33:03.4735922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4736601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.4736974Z 
2025-05-07T20:33:03.4737089Z     @given(
2025-05-07T20:33:03.4737400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4737835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4738274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4738741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4739209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4739603Z     )
2025-05-07T20:33:03.4740080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4740701Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4741032Z         self,
2025-05-07T20:33:03.4741294Z         T: int,
2025-05-07T20:33:03.4741585Z         D: int,
2025-05-07T20:33:03.4763039Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4763427Z         contiguous: bool,
2025-05-07T20:33:03.4763759Z         compiled: bool,
2025-05-07T20:33:03.4764070Z     ) -> None:
2025-05-07T20:33:03.4764351Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4764681Z     
2025-05-07T20:33:03.4765055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4765769Z     
2025-05-07T20:33:03.4766034Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4766429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4766845Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4767177Z         x0 = x[:, :D]
2025-05-07T20:33:03.4767479Z         x1 = x[:, D:]
2025-05-07T20:33:03.4767752Z     
2025-05-07T20:33:03.4768014Z         if contiguous:
2025-05-07T20:33:03.4768336Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4768676Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4768995Z     
2025-05-07T20:33:03.4769255Z         if scale_ub is not None:
2025-05-07T20:33:03.4769619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4770074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4770506Z             )
2025-05-07T20:33:03.4770767Z         else:
2025-05-07T20:33:03.4771061Z             scale_ub_tensor = None
2025-05-07T20:33:03.4771404Z     
2025-05-07T20:33:03.4771710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4772131Z             op = silu_mul_quant
2025-05-07T20:33:03.4772477Z             if compiled:
2025-05-07T20:33:03.4772811Z                 op = torch.compile(op)
2025-05-07T20:33:03.4773208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4773576Z     
2025-05-07T20:33:03.4773830Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4774421Z 
2025-05-07T20:33:03.4774565Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4774974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4775423Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4775877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4776785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4777723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4778464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4779412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4780331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4781078Z     kernel = self.compile(
2025-05-07T20:33:03.4781831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4782754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4783322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4783626Z 
2025-05-07T20:33:03.4783895Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b920eb1a0>
2025-05-07T20:33:03.4785286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4787083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b91e10180>}
2025-05-07T20:33:03.4788825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4790203Z context = <triton._C.libtriton.ir.context object at 0x7f8b90606af0>
2025-05-07T20:33:03.4790574Z 
2025-05-07T20:33:03.4790788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4791465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4792064Z                            module_map=module_map)
2025-05-07T20:33:03.4792529Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4792984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.4793343Z E       ^
2025-05-07T20:33:03.4793948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4794546Z 
2025-05-07T20:33:03.4795101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4795884Z 
2025-05-07T20:33:03.4796022Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4796557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4797093Z     T=2048,
2025-05-07T20:33:03.4797339Z     D=5120,
2025-05-07T20:33:03.4797597Z     scale_ub=1200.0,
2025-05-07T20:33:03.4797896Z     contiguous=True,
2025-05-07T20:33:03.4798215Z     compiled=True,
2025-05-07T20:33:03.4798521Z )
2025-05-07T20:33:03.4798950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4799601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.4799978Z 
2025-05-07T20:33:03.4800093Z     @given(
2025-05-07T20:33:03.4800424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4800861Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4801379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4801874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4802344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4802772Z     )
2025-05-07T20:33:03.4803249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4803853Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4804176Z         self,
2025-05-07T20:33:03.4804455Z         T: int,
2025-05-07T20:33:03.4804726Z         D: int,
2025-05-07T20:33:03.4805022Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4805390Z         contiguous: bool,
2025-05-07T20:33:03.4805723Z         compiled: bool,
2025-05-07T20:33:03.4806037Z     ) -> None:
2025-05-07T20:33:03.4806329Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4806660Z     
2025-05-07T20:33:03.4807035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4807508Z     
2025-05-07T20:33:03.4807791Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4808197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4808605Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4808926Z         x0 = x[:, :D]
2025-05-07T20:33:03.4809236Z         x1 = x[:, D:]
2025-05-07T20:33:03.4809521Z     
2025-05-07T20:33:03.4809772Z         if contiguous:
2025-05-07T20:33:03.4810084Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4810440Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4810775Z     
2025-05-07T20:33:03.4811040Z         if scale_ub is not None:
2025-05-07T20:33:03.4811418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4811875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4812298Z             )
2025-05-07T20:33:03.4812556Z         else:
2025-05-07T20:33:03.4812846Z             scale_ub_tensor = None
2025-05-07T20:33:03.4813190Z     
2025-05-07T20:33:03.4813509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4813954Z             op = silu_mul_quant
2025-05-07T20:33:03.4814297Z             if compiled:
2025-05-07T20:33:03.4814628Z                 op = torch.compile(op)
2025-05-07T20:33:03.4815031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4815395Z     
2025-05-07T20:33:03.4815664Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.4816060Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.4816462Z     
2025-05-07T20:33:03.4816795Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4817258Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.4817670Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.4818103Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.4818608Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4819041Z     
2025-05-07T20:33:03.4819330Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.4819609Z 
2025-05-07T20:33:03.4819765Z moe/activation_test.py:126: 
2025-05-07T20:33:03.4820165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4820628Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.4821089Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4822132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.4823123Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.4823851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4824773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4825684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.4826762Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.4827786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.4828642Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.4829502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.4830207Z     fn()
2025-05-07T20:33:03.4830881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.4831665Z     self.fn.run(
2025-05-07T20:33:03.4832297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4833023Z     kernel = self.compile(
2025-05-07T20:33:03.4833769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4834676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4835210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4835512Z 
2025-05-07T20:33:03.4835826Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91ab5b80>
2025-05-07T20:33:03.4836907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4838278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b906de840>}
2025-05-07T20:33:03.4839615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4840643Z context = <triton._C.libtriton.ir.context object at 0x7f8b903e2d30>
2025-05-07T20:33:03.4840934Z 
2025-05-07T20:33:03.4841099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4841623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4842092Z                            module_map=module_map)
2025-05-07T20:33:03.4842451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4842806Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.4843077Z E       ^
2025-05-07T20:33:03.4843541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4843991Z 
2025-05-07T20:33:03.4844404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4844919Z 
2025-05-07T20:33:03.4845029Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4845450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4845851Z     T=16384,
2025-05-07T20:33:03.4846055Z     D=7168,
2025-05-07T20:33:03.4846257Z     scale_ub=1200.0,
2025-05-07T20:33:03.4846482Z     contiguous=False,
2025-05-07T20:33:03.4846700Z     compiled=False,
2025-05-07T20:33:03.4846910Z )
2025-05-07T20:33:03.4847230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4847728Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.4848010Z 
2025-05-07T20:33:03.4848092Z     @given(
2025-05-07T20:33:03.4848324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4848634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4848941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4849277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4849779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4850071Z     )
2025-05-07T20:33:03.4850420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4850903Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4851145Z         self,
2025-05-07T20:33:03.4851343Z         T: int,
2025-05-07T20:33:03.4851543Z         D: int,
2025-05-07T20:33:03.4851756Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4852029Z         contiguous: bool,
2025-05-07T20:33:03.4852271Z         compiled: bool,
2025-05-07T20:33:03.4852488Z     ) -> None:
2025-05-07T20:33:03.4852710Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4852954Z     
2025-05-07T20:33:03.4853220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4853568Z     
2025-05-07T20:33:03.4853774Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4854058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4854376Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4854625Z         x0 = x[:, :D]
2025-05-07T20:33:03.4854840Z         x1 = x[:, D:]
2025-05-07T20:33:03.4855056Z     
2025-05-07T20:33:03.4855244Z         if contiguous:
2025-05-07T20:33:03.4855481Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4855738Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4855983Z     
2025-05-07T20:33:03.4856180Z         if scale_ub is not None:
2025-05-07T20:33:03.4856448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4856787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4857104Z             )
2025-05-07T20:33:03.4857292Z         else:
2025-05-07T20:33:03.4857505Z             scale_ub_tensor = None
2025-05-07T20:33:03.4857756Z     
2025-05-07T20:33:03.4857984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4858304Z             op = silu_mul_quant
2025-05-07T20:33:03.4858559Z             if compiled:
2025-05-07T20:33:03.4858807Z                 op = torch.compile(op)
2025-05-07T20:33:03.4859104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4859378Z     
2025-05-07T20:33:03.4859573Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4859741Z 
2025-05-07T20:33:03.4859844Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4860143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4860479Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4860760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4861451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4862137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4862665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4863347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4864021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4864553Z     kernel = self.compile(
2025-05-07T20:33:03.4865086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4866107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4866510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4866740Z 
2025-05-07T20:33:03.4866952Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91a7f9b0>
2025-05-07T20:33:03.4868029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4869625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cd260>}
2025-05-07T20:33:03.4871027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4872117Z context = <triton._C.libtriton.ir.context object at 0x7f8b90402230>
2025-05-07T20:33:03.4872404Z 
2025-05-07T20:33:03.4872570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4873094Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4873568Z                            module_map=module_map)
2025-05-07T20:33:03.4873940Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4874293Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.4874559Z E       ^
2025-05-07T20:33:03.4875030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4875484Z 
2025-05-07T20:33:03.4875960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4876484Z 
2025-05-07T20:33:03.4876590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4877007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4877412Z     T=1,
2025-05-07T20:33:03.4877593Z     D=7168,
2025-05-07T20:33:03.4877789Z     scale_ub=None,
2025-05-07T20:33:03.4878007Z     contiguous=True,
2025-05-07T20:33:03.4878224Z     compiled=True,
2025-05-07T20:33:03.4878430Z )
2025-05-07T20:33:03.4878755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4879232Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.4879495Z 
2025-05-07T20:33:03.4879575Z     @given(
2025-05-07T20:33:03.4879815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4880131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4880434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4880771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4881103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4881383Z     )
2025-05-07T20:33:03.4881733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4882178Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4882417Z         self,
2025-05-07T20:33:03.4882614Z         T: int,
2025-05-07T20:33:03.4882819Z         D: int,
2025-05-07T20:33:03.4883033Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4883305Z         contiguous: bool,
2025-05-07T20:33:03.4883550Z         compiled: bool,
2025-05-07T20:33:03.4883768Z     ) -> None:
2025-05-07T20:33:03.4883988Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4884237Z     
2025-05-07T20:33:03.4884517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4884855Z     
2025-05-07T20:33:03.4885053Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4885355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4885662Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4885911Z         x0 = x[:, :D]
2025-05-07T20:33:03.4886132Z         x1 = x[:, D:]
2025-05-07T20:33:03.4886335Z     
2025-05-07T20:33:03.4886525Z         if contiguous:
2025-05-07T20:33:03.4886759Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4887017Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4887264Z     
2025-05-07T20:33:03.4887465Z         if scale_ub is not None:
2025-05-07T20:33:03.4887742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4888086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4888401Z             )
2025-05-07T20:33:03.4888590Z         else:
2025-05-07T20:33:03.4888938Z             scale_ub_tensor = None
2025-05-07T20:33:03.4889198Z     
2025-05-07T20:33:03.4889434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4889750Z             op = silu_mul_quant
2025-05-07T20:33:03.4890046Z             if compiled:
2025-05-07T20:33:03.4890294Z                 op = torch.compile(op)
2025-05-07T20:33:03.4890585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4890864Z     
2025-05-07T20:33:03.4891064Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.4891344Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.4891637Z     
2025-05-07T20:33:03.4891878Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4892214Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.4892512Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.4892829Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.4893188Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4893506Z     
2025-05-07T20:33:03.4893713Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.4893909Z 
2025-05-07T20:33:03.4894016Z moe/activation_test.py:126: 
2025-05-07T20:33:03.4894311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4894650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.4894979Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4895757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.4896509Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.4897054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4897732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4898426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.4899153Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.4899889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.4900535Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.4901133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.4901652Z     fn()
2025-05-07T20:33:03.4902160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.4902736Z     self.fn.run(
2025-05-07T20:33:03.4903201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4903732Z     kernel = self.compile(
2025-05-07T20:33:03.4904275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4904926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4905335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4905565Z 
2025-05-07T20:33:03.4905779Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90a79460>
2025-05-07T20:33:03.4906861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4908234Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67ecaac0>}
2025-05-07T20:33:03.4909675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4910743Z context = <triton._C.libtriton.ir.context object at 0x7f8b67ff01b0>
2025-05-07T20:33:03.4911070Z 
2025-05-07T20:33:03.4911244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4911764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4912237Z                            module_map=module_map)
2025-05-07T20:33:03.4912606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4912963Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.4913234Z E       ^
2025-05-07T20:33:03.4913701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4914151Z 
2025-05-07T20:33:03.4914575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4915086Z 
2025-05-07T20:33:03.4915193Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4915611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4916077Z     T=4096,
2025-05-07T20:33:03.4916267Z     D=5120,
2025-05-07T20:33:03.4916465Z     scale_ub=None,
2025-05-07T20:33:03.4916689Z     contiguous=False,
2025-05-07T20:33:03.4916924Z     compiled=False,
2025-05-07T20:33:03.4917127Z )
2025-05-07T20:33:03.4917453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4917953Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.4918230Z 
2025-05-07T20:33:03.4918311Z     @given(
2025-05-07T20:33:03.4918550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4918871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4919186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4919520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4919858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4920144Z     )
2025-05-07T20:33:03.4920498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4920945Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4921195Z         self,
2025-05-07T20:33:03.4921388Z         T: int,
2025-05-07T20:33:03.4921592Z         D: int,
2025-05-07T20:33:03.4921814Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4922079Z         contiguous: bool,
2025-05-07T20:33:03.4922329Z         compiled: bool,
2025-05-07T20:33:03.4922555Z     ) -> None:
2025-05-07T20:33:03.4922769Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4923013Z     
2025-05-07T20:33:03.4923285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4923623Z     
2025-05-07T20:33:03.4923826Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4924117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4924422Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4924673Z         x0 = x[:, :D]
2025-05-07T20:33:03.4924894Z         x1 = x[:, D:]
2025-05-07T20:33:03.4925098Z     
2025-05-07T20:33:03.4925287Z         if contiguous:
2025-05-07T20:33:03.4925523Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4925783Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4926018Z     
2025-05-07T20:33:03.4926216Z         if scale_ub is not None:
2025-05-07T20:33:03.4926491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4926822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4927133Z             )
2025-05-07T20:33:03.4927327Z         else:
2025-05-07T20:33:03.4935473Z             scale_ub_tensor = None
2025-05-07T20:33:03.4935776Z     
2025-05-07T20:33:03.4936018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4936540Z             op = silu_mul_quant
2025-05-07T20:33:03.4936809Z             if compiled:
2025-05-07T20:33:03.4937070Z                 op = torch.compile(op)
2025-05-07T20:33:03.4937412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4937698Z     
2025-05-07T20:33:03.4937904Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4938071Z 
2025-05-07T20:33:03.4938173Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4938476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4938817Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4939100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4939800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4940497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4941038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4941734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4942396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4942943Z     kernel = self.compile(
2025-05-07T20:33:03.4943327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4943512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4943645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4943650Z 
2025-05-07T20:33:03.4943864Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90873920>
2025-05-07T20:33:03.4944651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4945160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cdee0>}
2025-05-07T20:33:03.4945918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4946112Z context = <triton._C.libtriton.ir.context object at 0x7f8b9004e9f0>
2025-05-07T20:33:03.4946116Z 
2025-05-07T20:33:03.4946293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4946560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4946673Z                            module_map=module_map)
2025-05-07T20:33:03.4946853Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4946958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.4947047Z E       ^
2025-05-07T20:33:03.4947406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4947413Z 
2025-05-07T20:33:03.4947829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4947834Z 
2025-05-07T20:33:03.4947948Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4948175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4948265Z     T=4096,
2025-05-07T20:33:03.4948346Z     D=7168,
2025-05-07T20:33:03.4948431Z     scale_ub=None,
2025-05-07T20:33:03.4948530Z     contiguous=False,
2025-05-07T20:33:03.4948618Z     compiled=False,
2025-05-07T20:33:03.4948697Z )
2025-05-07T20:33:03.4948928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4949235Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.4949241Z 
2025-05-07T20:33:03.4949323Z     @given(
2025-05-07T20:33:03.4949456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4949600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4949730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4949851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4949970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4950054Z     )
2025-05-07T20:33:03.4950299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4950394Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4950482Z         self,
2025-05-07T20:33:03.4950561Z         T: int,
2025-05-07T20:33:03.4950640Z         D: int,
2025-05-07T20:33:03.4950750Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4950842Z         contiguous: bool,
2025-05-07T20:33:03.4950936Z         compiled: bool,
2025-05-07T20:33:03.4951028Z     ) -> None:
2025-05-07T20:33:03.4951125Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4951209Z     
2025-05-07T20:33:03.4951383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4951458Z     
2025-05-07T20:33:03.4951565Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4951693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4951784Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4951878Z         x0 = x[:, :D]
2025-05-07T20:33:03.4951960Z         x1 = x[:, D:]
2025-05-07T20:33:03.4952035Z     
2025-05-07T20:33:03.4952128Z         if contiguous:
2025-05-07T20:33:03.4952225Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4952320Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4952403Z     
2025-05-07T20:33:03.4952494Z         if scale_ub is not None:
2025-05-07T20:33:03.4952603Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4952753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4952835Z             )
2025-05-07T20:33:03.4952920Z         else:
2025-05-07T20:33:03.4953018Z             scale_ub_tensor = None
2025-05-07T20:33:03.4953096Z     
2025-05-07T20:33:03.4953234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4953325Z             op = silu_mul_quant
2025-05-07T20:33:03.4953411Z             if compiled:
2025-05-07T20:33:03.4953520Z                 op = torch.compile(op)
2025-05-07T20:33:03.4953627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4953702Z     
2025-05-07T20:33:03.4953801Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4953806Z 
2025-05-07T20:33:03.4953906Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4954044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4954149Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4954255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4954764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4954865Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4955223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4955454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4955883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4955986Z     kernel = self.compile(
2025-05-07T20:33:03.4956366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4956545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4956770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4956813Z 
2025-05-07T20:33:03.4957017Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90668620>
2025-05-07T20:33:03.4957804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4958348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b908cd940>}
2025-05-07T20:33:03.4959093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4959291Z context = <triton._C.libtriton.ir.context object at 0x7f8b900a35b0>
2025-05-07T20:33:03.4959295Z 
2025-05-07T20:33:03.4959468Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4959740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4959853Z                            module_map=module_map)
2025-05-07T20:33:03.4960015Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4960125Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.4960206Z E       ^
2025-05-07T20:33:03.4960561Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4960573Z 
2025-05-07T20:33:03.4960984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4960988Z 
2025-05-07T20:33:03.4961094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4961324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4961410Z     T=128,
2025-05-07T20:33:03.4961490Z     D=7168,
2025-05-07T20:33:03.4961584Z     scale_ub=None,
2025-05-07T20:33:03.4961673Z     contiguous=False,
2025-05-07T20:33:03.4961760Z     compiled=True,
2025-05-07T20:33:03.4961848Z )
2025-05-07T20:33:03.4962068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4962246Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.4962251Z 
2025-05-07T20:33:03.4962331Z     @given(
2025-05-07T20:33:03.4962452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4962564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4962680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4962798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4962921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4962997Z     )
2025-05-07T20:33:03.4963258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4963357Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4963440Z         self,
2025-05-07T20:33:03.4963527Z         T: int,
2025-05-07T20:33:03.4963610Z         D: int,
2025-05-07T20:33:03.4963711Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4963811Z         contiguous: bool,
2025-05-07T20:33:03.4963899Z         compiled: bool,
2025-05-07T20:33:03.4963980Z     ) -> None:
2025-05-07T20:33:03.4964085Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4964160Z     
2025-05-07T20:33:03.4964329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4964414Z     
2025-05-07T20:33:03.4964508Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4964635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4964733Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4964816Z         x0 = x[:, :D]
2025-05-07T20:33:03.4964908Z         x1 = x[:, D:]
2025-05-07T20:33:03.4964988Z     
2025-05-07T20:33:03.4965197Z         if contiguous:
2025-05-07T20:33:03.4965301Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4965655Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4965773Z     
2025-05-07T20:33:03.4966045Z         if scale_ub is not None:
2025-05-07T20:33:03.4966156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4966295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4966382Z             )
2025-05-07T20:33:03.4966460Z         else:
2025-05-07T20:33:03.4966556Z             scale_ub_tensor = None
2025-05-07T20:33:03.4966641Z     
2025-05-07T20:33:03.4966770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4966868Z             op = silu_mul_quant
2025-05-07T20:33:03.4966957Z             if compiled:
2025-05-07T20:33:03.4967059Z                 op = torch.compile(op)
2025-05-07T20:33:03.4967172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4967251Z     
2025-05-07T20:33:03.4967353Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.4967485Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.4967560Z     
2025-05-07T20:33:03.4967697Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4967812Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.4967913Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.4968035Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.4968185Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4968261Z     
2025-05-07T20:33:03.4968368Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.4968373Z 
2025-05-07T20:33:03.4968473Z moe/activation_test.py:126: 
2025-05-07T20:33:03.4968604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4968716Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.4968850Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.4969422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.4969548Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.4969932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4970159Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4970523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.4970780Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.4971158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.4971327Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.4971683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.4971761Z     fn()
2025-05-07T20:33:03.4972159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.4972253Z     self.fn.run(
2025-05-07T20:33:03.4972588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4972682Z     kernel = self.compile(
2025-05-07T20:33:03.4973069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4973243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4973383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4973387Z 
2025-05-07T20:33:03.4973594Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91baa9c0>
2025-05-07T20:33:03.4974599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4975151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90281620>}
2025-05-07T20:33:03.4975894Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4976093Z context = <triton._C.libtriton.ir.context object at 0x7f8b67fa0070>
2025-05-07T20:33:03.4976098Z 
2025-05-07T20:33:03.4976263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4976542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4976652Z                            module_map=module_map)
2025-05-07T20:33:03.4976814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4976935Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.4977015Z E       ^
2025-05-07T20:33:03.4977371Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4977376Z 
2025-05-07T20:33:03.4977795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4977800Z 
2025-05-07T20:33:03.4977904Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4978134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4978214Z     T=128,
2025-05-07T20:33:03.4978294Z     D=7168,
2025-05-07T20:33:03.4978385Z     scale_ub=None,
2025-05-07T20:33:03.4978473Z     contiguous=False,
2025-05-07T20:33:03.4978566Z     compiled=False,
2025-05-07T20:33:03.4978651Z )
2025-05-07T20:33:03.4978871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4979045Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.4979049Z 
2025-05-07T20:33:03.4979136Z     @given(
2025-05-07T20:33:03.4979258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4979381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4979511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4979649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4979775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4979852Z     )
2025-05-07T20:33:03.4980098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4980201Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4980279Z         self,
2025-05-07T20:33:03.4980366Z         T: int,
2025-05-07T20:33:03.4980455Z         D: int,
2025-05-07T20:33:03.4980555Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4980656Z         contiguous: bool,
2025-05-07T20:33:03.4980745Z         compiled: bool,
2025-05-07T20:33:03.4980826Z     ) -> None:
2025-05-07T20:33:03.4980928Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4981003Z     
2025-05-07T20:33:03.4981172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4981257Z     
2025-05-07T20:33:03.4981350Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4981475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4981568Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4981664Z         x0 = x[:, :D]
2025-05-07T20:33:03.4981745Z         x1 = x[:, D:]
2025-05-07T20:33:03.4981819Z     
2025-05-07T20:33:03.4981910Z         if contiguous:
2025-05-07T20:33:03.4982005Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4982204Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4982321Z     
2025-05-07T20:33:03.4982413Z         if scale_ub is not None:
2025-05-07T20:33:03.4982519Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4982661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4982784Z             )
2025-05-07T20:33:03.4982869Z         else:
2025-05-07T20:33:03.4982964Z             scale_ub_tensor = None
2025-05-07T20:33:03.4983040Z     
2025-05-07T20:33:03.4983174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4983265Z             op = silu_mul_quant
2025-05-07T20:33:03.4983353Z             if compiled:
2025-05-07T20:33:03.4983463Z                 op = torch.compile(op)
2025-05-07T20:33:03.4983568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4983641Z     
2025-05-07T20:33:03.4983737Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4983742Z 
2025-05-07T20:33:03.4983838Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4983978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4984081Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4984182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4984687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4984783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4985137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4985365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4985704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4985803Z     kernel = self.compile(
2025-05-07T20:33:03.4986181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4986366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4986498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4986505Z 
2025-05-07T20:33:03.4986709Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f16210>
2025-05-07T20:33:03.4987488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.4987988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90282160>}
2025-05-07T20:33:03.4988736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.4988934Z context = <triton._C.libtriton.ir.context object at 0x7f8b6780ccf0>
2025-05-07T20:33:03.4988939Z 
2025-05-07T20:33:03.4989101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.4989371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.4989479Z                            module_map=module_map)
2025-05-07T20:33:03.4989638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.4989744Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.4989823Z E       ^
2025-05-07T20:33:03.4990178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.4990191Z 
2025-05-07T20:33:03.4990602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.4990654Z 
2025-05-07T20:33:03.4990829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.4991060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.4991138Z     T=4096,
2025-05-07T20:33:03.4991258Z     D=5120,
2025-05-07T20:33:03.4991346Z     scale_ub=1200.0,
2025-05-07T20:33:03.4991433Z     contiguous=True,
2025-05-07T20:33:03.4991517Z     compiled=False,
2025-05-07T20:33:03.4991597Z )
2025-05-07T20:33:03.4991815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.4991995Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.4991999Z 
2025-05-07T20:33:03.4992081Z     @given(
2025-05-07T20:33:03.4992200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.4992306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.4992422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.4992538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.4992664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.4992740Z     )
2025-05-07T20:33:03.4992989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.4993088Z     def test_silu_mul_quant(
2025-05-07T20:33:03.4993167Z         self,
2025-05-07T20:33:03.4993250Z         T: int,
2025-05-07T20:33:03.4993328Z         D: int,
2025-05-07T20:33:03.4993430Z         scale_ub: Optional[float],
2025-05-07T20:33:03.4993528Z         contiguous: bool,
2025-05-07T20:33:03.4993614Z         compiled: bool,
2025-05-07T20:33:03.4993692Z     ) -> None:
2025-05-07T20:33:03.4993792Z         torch.manual_seed(2025)
2025-05-07T20:33:03.4993868Z     
2025-05-07T20:33:03.4994034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.4994116Z     
2025-05-07T20:33:03.4994209Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.4994334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.4994437Z         x = x_sign * x_clamp
2025-05-07T20:33:03.4994520Z         x0 = x[:, :D]
2025-05-07T20:33:03.4994608Z         x1 = x[:, D:]
2025-05-07T20:33:03.4994683Z     
2025-05-07T20:33:03.4994772Z         if contiguous:
2025-05-07T20:33:03.4994871Z             x0 = x0.contiguous()
2025-05-07T20:33:03.4994961Z             x1 = x1.contiguous()
2025-05-07T20:33:03.4995035Z     
2025-05-07T20:33:03.4995133Z         if scale_ub is not None:
2025-05-07T20:33:03.4995240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.4995377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.4995459Z             )
2025-05-07T20:33:03.4995538Z         else:
2025-05-07T20:33:03.4995633Z             scale_ub_tensor = None
2025-05-07T20:33:03.4995715Z     
2025-05-07T20:33:03.4995927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.4996029Z             op = silu_mul_quant
2025-05-07T20:33:03.4996115Z             if compiled:
2025-05-07T20:33:03.4996225Z                 op = torch.compile(op)
2025-05-07T20:33:03.4996337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4996410Z     
2025-05-07T20:33:03.4996502Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.4996509Z 
2025-05-07T20:33:03.4996613Z moe/activation_test.py:117: 
2025-05-07T20:33:03.4996743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4996844Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.4996951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.4997444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.4997548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.4997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.4998213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.4998592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.4998688Z     kernel = self.compile(
2025-05-07T20:33:03.4999129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.4999308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.4999435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.4999439Z 
2025-05-07T20:33:03.4999647Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f17980>
2025-05-07T20:33:03.5000421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5000934Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9204efc0>}
2025-05-07T20:33:03.5001678Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5001869Z context = <triton._C.libtriton.ir.context object at 0x7f8b6781c970>
2025-05-07T20:33:03.5001874Z 
2025-05-07T20:33:03.5002043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5002306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5002420Z                            module_map=module_map)
2025-05-07T20:33:03.5002580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5002682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5002767Z E       ^
2025-05-07T20:33:03.5003127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5003131Z 
2025-05-07T20:33:03.5003539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5003552Z 
2025-05-07T20:33:03.5003656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5003879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5003969Z     T=1,
2025-05-07T20:33:03.5004046Z     D=5120,
2025-05-07T20:33:03.5004128Z     scale_ub=None,
2025-05-07T20:33:03.5004222Z     contiguous=True,
2025-05-07T20:33:03.5004304Z     compiled=True,
2025-05-07T20:33:03.5004379Z )
2025-05-07T20:33:03.5004602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5004764Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5004773Z 
2025-05-07T20:33:03.5004859Z     @given(
2025-05-07T20:33:03.5004982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5005082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5005202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5005323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5005437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5005516Z     )
2025-05-07T20:33:03.5005757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5005850Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5005934Z         self,
2025-05-07T20:33:03.5006011Z         T: int,
2025-05-07T20:33:03.5006089Z         D: int,
2025-05-07T20:33:03.5006193Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5006286Z         contiguous: bool,
2025-05-07T20:33:03.5006379Z         compiled: bool,
2025-05-07T20:33:03.5006458Z     ) -> None:
2025-05-07T20:33:03.5006635Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5006776Z     
2025-05-07T20:33:03.5006945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5007019Z     
2025-05-07T20:33:03.5007116Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5007282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5007371Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5007459Z         x0 = x[:, :D]
2025-05-07T20:33:03.5007541Z         x1 = x[:, D:]
2025-05-07T20:33:03.5007618Z     
2025-05-07T20:33:03.5007708Z         if contiguous:
2025-05-07T20:33:03.5007800Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5007889Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5007968Z     
2025-05-07T20:33:03.5008060Z         if scale_ub is not None:
2025-05-07T20:33:03.5008178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5008311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5008388Z             )
2025-05-07T20:33:03.5008478Z         else:
2025-05-07T20:33:03.5008575Z             scale_ub_tensor = None
2025-05-07T20:33:03.5008649Z     
2025-05-07T20:33:03.5008787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5008888Z             op = silu_mul_quant
2025-05-07T20:33:03.5008976Z             if compiled:
2025-05-07T20:33:03.5009084Z                 op = torch.compile(op)
2025-05-07T20:33:03.5009192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5009271Z     
2025-05-07T20:33:03.5009369Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5009494Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5009577Z     
2025-05-07T20:33:03.5009712Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5009814Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5009927Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5010049Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5010197Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5010283Z     
2025-05-07T20:33:03.5010388Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5010392Z 
2025-05-07T20:33:03.5010492Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5010633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5010739Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5010877Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5011433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5011534Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5011899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5012119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5012497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5012756Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5013128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5013297Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5013636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5013713Z     fn()
2025-05-07T20:33:03.5014120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5014202Z     self.fn.run(
2025-05-07T20:33:03.5014543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5014760Z     kernel = self.compile(
2025-05-07T20:33:03.5015140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5015320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5015488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5015492Z 
2025-05-07T20:33:03.5015701Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b901bcce0>
2025-05-07T20:33:03.5016476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5016982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b901cdee0>}
2025-05-07T20:33:03.5017738Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5017933Z context = <triton._C.libtriton.ir.context object at 0x7f8b678b35b0>
2025-05-07T20:33:03.5017937Z 
2025-05-07T20:33:03.5018106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5018367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5018475Z                            module_map=module_map)
2025-05-07T20:33:03.5018642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5018748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5018828Z E       ^
2025-05-07T20:33:03.5019188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5019196Z 
2025-05-07T20:33:03.5019661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5019666Z 
2025-05-07T20:33:03.5019776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5020002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5020080Z     T=2048,
2025-05-07T20:33:03.5020164Z     D=5120,
2025-05-07T20:33:03.5020246Z     scale_ub=None,
2025-05-07T20:33:03.5020338Z     contiguous=True,
2025-05-07T20:33:03.5020421Z     compiled=True,
2025-05-07T20:33:03.5020494Z )
2025-05-07T20:33:03.5020717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5020886Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5020890Z 
2025-05-07T20:33:03.5020967Z     @given(
2025-05-07T20:33:03.5021093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5021198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5021314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5021435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5021548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5021632Z     )
2025-05-07T20:33:03.5021878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5021973Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5022058Z         self,
2025-05-07T20:33:03.5022136Z         T: int,
2025-05-07T20:33:03.5022214Z         D: int,
2025-05-07T20:33:03.5022322Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5022411Z         contiguous: bool,
2025-05-07T20:33:03.5022498Z         compiled: bool,
2025-05-07T20:33:03.5022587Z     ) -> None:
2025-05-07T20:33:03.5022683Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5022757Z     
2025-05-07T20:33:03.5022933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5023125Z     
2025-05-07T20:33:03.5023219Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5023350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5023439Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5023568Z         x0 = x[:, :D]
2025-05-07T20:33:03.5023650Z         x1 = x[:, D:]
2025-05-07T20:33:03.5023727Z     
2025-05-07T20:33:03.5023817Z         if contiguous:
2025-05-07T20:33:03.5023910Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5023999Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5024078Z     
2025-05-07T20:33:03.5024171Z         if scale_ub is not None:
2025-05-07T20:33:03.5024278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5024420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5024499Z             )
2025-05-07T20:33:03.5024576Z         else:
2025-05-07T20:33:03.5024678Z             scale_ub_tensor = None
2025-05-07T20:33:03.5024755Z     
2025-05-07T20:33:03.5024896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5024992Z             op = silu_mul_quant
2025-05-07T20:33:03.5025079Z             if compiled:
2025-05-07T20:33:03.5025186Z                 op = torch.compile(op)
2025-05-07T20:33:03.5025295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5025370Z     
2025-05-07T20:33:03.5025467Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5025588Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5025661Z     
2025-05-07T20:33:03.5025806Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5025910Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5026014Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5026143Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5026280Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5026361Z     
2025-05-07T20:33:03.5026461Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5026473Z 
2025-05-07T20:33:03.5026573Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5026708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5026816Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5026951Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5027515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5027617Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5027984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5028205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5028570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5028839Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5029214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5029392Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5029777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5029858Z     fn()
2025-05-07T20:33:03.5030262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5030345Z     self.fn.run(
2025-05-07T20:33:03.5030680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5030779Z     kernel = self.compile(
2025-05-07T20:33:03.5031241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5031480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5031608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5031651Z 
2025-05-07T20:33:03.5031855Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90870830>
2025-05-07T20:33:03.5032636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5033138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90a64ea0>}
2025-05-07T20:33:03.5033893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5034086Z context = <triton._C.libtriton.ir.context object at 0x7f8b673c62b0>
2025-05-07T20:33:03.5034091Z 
2025-05-07T20:33:03.5034256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5034527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5034634Z                            module_map=module_map)
2025-05-07T20:33:03.5034800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5034903Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5034982Z E       ^
2025-05-07T20:33:03.5035345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5035350Z 
2025-05-07T20:33:03.5035832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5035845Z 
2025-05-07T20:33:03.5035956Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5036180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5036262Z     T=128,
2025-05-07T20:33:03.5036347Z     D=5120,
2025-05-07T20:33:03.5036431Z     scale_ub=None,
2025-05-07T20:33:03.5036516Z     contiguous=True,
2025-05-07T20:33:03.5036605Z     compiled=True,
2025-05-07T20:33:03.5036683Z )
2025-05-07T20:33:03.5036900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5037073Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5037078Z 
2025-05-07T20:33:03.5037154Z     @given(
2025-05-07T20:33:03.5037279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5037379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5037493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5037619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5037736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5037811Z     )
2025-05-07T20:33:03.5038060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5038156Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5038241Z         self,
2025-05-07T20:33:03.5038326Z         T: int,
2025-05-07T20:33:03.5038404Z         D: int,
2025-05-07T20:33:03.5038505Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5038603Z         contiguous: bool,
2025-05-07T20:33:03.5038690Z         compiled: bool,
2025-05-07T20:33:03.5038779Z     ) -> None:
2025-05-07T20:33:03.5038874Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5038953Z     
2025-05-07T20:33:03.5039127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5039201Z     
2025-05-07T20:33:03.5039292Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5039425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5039668Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5039760Z         x0 = x[:, :D]
2025-05-07T20:33:03.5039861Z         x1 = x[:, D:]
2025-05-07T20:33:03.5039935Z     
2025-05-07T20:33:03.5040058Z         if contiguous:
2025-05-07T20:33:03.5040156Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5040245Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5040318Z     
2025-05-07T20:33:03.5040414Z         if scale_ub is not None:
2025-05-07T20:33:03.5040521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5040661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5040737Z             )
2025-05-07T20:33:03.5040814Z         else:
2025-05-07T20:33:03.5040916Z             scale_ub_tensor = None
2025-05-07T20:33:03.5040989Z     
2025-05-07T20:33:03.5041117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5041215Z             op = silu_mul_quant
2025-05-07T20:33:03.5041300Z             if compiled:
2025-05-07T20:33:03.5041407Z                 op = torch.compile(op)
2025-05-07T20:33:03.5041520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5041593Z     
2025-05-07T20:33:03.5041684Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5041814Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5041887Z     
2025-05-07T20:33:03.5042028Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5042130Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5042230Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5042359Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5042498Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5042572Z     
2025-05-07T20:33:03.5042681Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5042685Z 
2025-05-07T20:33:03.5042783Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5042919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5043033Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5043165Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5043730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5043831Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5044191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5044417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5044783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5045043Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5045420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5045588Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5045933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5046014Z     fn()
2025-05-07T20:33:03.5046412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5046501Z     self.fn.run(
2025-05-07T20:33:03.5046836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5046936Z     kernel = self.compile(
2025-05-07T20:33:03.5047314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5047488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5047784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5047830Z 
2025-05-07T20:33:03.5048034Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90a7a000>
2025-05-07T20:33:03.5048855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5049360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66e86f20>}
2025-05-07T20:33:03.5050153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5050353Z context = <triton._C.libtriton.ir.context object at 0x7f8b66d9fe30>
2025-05-07T20:33:03.5050364Z 
2025-05-07T20:33:03.5050530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5050800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5050912Z                            module_map=module_map)
2025-05-07T20:33:03.5051073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5051186Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5051263Z E       ^
2025-05-07T20:33:03.5051619Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5051629Z 
2025-05-07T20:33:03.5052039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5052044Z 
2025-05-07T20:33:03.5052151Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5052386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5052466Z     T=4096,
2025-05-07T20:33:03.5052545Z     D=5120,
2025-05-07T20:33:03.5052634Z     scale_ub=None,
2025-05-07T20:33:03.5052720Z     contiguous=True,
2025-05-07T20:33:03.5052807Z     compiled=True,
2025-05-07T20:33:03.5052889Z )
2025-05-07T20:33:03.5053107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5053282Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5053287Z 
2025-05-07T20:33:03.5053367Z     @given(
2025-05-07T20:33:03.5053486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5053592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5053706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5053822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5053942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5054018Z     )
2025-05-07T20:33:03.5054281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5054375Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5054455Z         self,
2025-05-07T20:33:03.5054541Z         T: int,
2025-05-07T20:33:03.5054619Z         D: int,
2025-05-07T20:33:03.5054719Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5054815Z         contiguous: bool,
2025-05-07T20:33:03.5054902Z         compiled: bool,
2025-05-07T20:33:03.5054980Z     ) -> None:
2025-05-07T20:33:03.5055082Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5055156Z     
2025-05-07T20:33:03.5055322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5055403Z     
2025-05-07T20:33:03.5055495Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5055618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5055712Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5055792Z         x0 = x[:, :D]
2025-05-07T20:33:03.5055964Z         x1 = x[:, D:]
2025-05-07T20:33:03.5056078Z     
2025-05-07T20:33:03.5056162Z         if contiguous:
2025-05-07T20:33:03.5056260Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5056349Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5056462Z     
2025-05-07T20:33:03.5056559Z         if scale_ub is not None:
2025-05-07T20:33:03.5056665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5056799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5056883Z             )
2025-05-07T20:33:03.5056960Z         else:
2025-05-07T20:33:03.5057054Z             scale_ub_tensor = None
2025-05-07T20:33:03.5057133Z     
2025-05-07T20:33:03.5057261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5057357Z             op = silu_mul_quant
2025-05-07T20:33:03.5057443Z             if compiled:
2025-05-07T20:33:03.5057542Z                 op = torch.compile(op)
2025-05-07T20:33:03.5057654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5057735Z     
2025-05-07T20:33:03.5057828Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5057954Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5058028Z     
2025-05-07T20:33:03.5058168Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5058276Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5058376Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5058499Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5058643Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5058717Z     
2025-05-07T20:33:03.5058833Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5058843Z 
2025-05-07T20:33:03.5058942Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5059071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5065088Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5065237Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5066180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5066301Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5066662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5066893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5067261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5067531Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5067907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5068079Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5068436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5068516Z     fn()
2025-05-07T20:33:03.5068917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5069009Z     self.fn.run(
2025-05-07T20:33:03.5069345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5069447Z     kernel = self.compile(
2025-05-07T20:33:03.5069828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5070005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5070143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5070148Z 
2025-05-07T20:33:03.5070590Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9066bad0>
2025-05-07T20:33:03.5071439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5072007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b671162a0>}
2025-05-07T20:33:03.5072750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5072955Z context = <triton._C.libtriton.ir.context object at 0x7f8b67795ff0>
2025-05-07T20:33:03.5072960Z 
2025-05-07T20:33:03.5073127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5073406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5073516Z                            module_map=module_map)
2025-05-07T20:33:03.5073679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5073786Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5073873Z E       ^
2025-05-07T20:33:03.5074228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5074233Z 
2025-05-07T20:33:03.5074644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5074648Z 
2025-05-07T20:33:03.5074764Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5074992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5075073Z     T=16384,
2025-05-07T20:33:03.5075158Z     D=5120,
2025-05-07T20:33:03.5075247Z     scale_ub=None,
2025-05-07T20:33:03.5075346Z     contiguous=True,
2025-05-07T20:33:03.5075438Z     compiled=True,
2025-05-07T20:33:03.5075514Z )
2025-05-07T20:33:03.5075810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5075990Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5075995Z 
2025-05-07T20:33:03.5076077Z     @given(
2025-05-07T20:33:03.5076206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5076310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5076426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5076551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5076666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5076751Z     )
2025-05-07T20:33:03.5076996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5077097Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5077190Z         self,
2025-05-07T20:33:03.5077270Z         T: int,
2025-05-07T20:33:03.5077349Z         D: int,
2025-05-07T20:33:03.5077459Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5077552Z         contiguous: bool,
2025-05-07T20:33:03.5077639Z         compiled: bool,
2025-05-07T20:33:03.5077730Z     ) -> None:
2025-05-07T20:33:03.5077827Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5077903Z     
2025-05-07T20:33:03.5078080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5078156Z     
2025-05-07T20:33:03.5078258Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5078387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5078478Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5078568Z         x0 = x[:, :D]
2025-05-07T20:33:03.5078651Z         x1 = x[:, D:]
2025-05-07T20:33:03.5078732Z     
2025-05-07T20:33:03.5078824Z         if contiguous:
2025-05-07T20:33:03.5078918Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5079132Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5079217Z     
2025-05-07T20:33:03.5079312Z         if scale_ub is not None:
2025-05-07T20:33:03.5079420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5079603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5079683Z             )
2025-05-07T20:33:03.5079768Z         else:
2025-05-07T20:33:03.5079866Z             scale_ub_tensor = None
2025-05-07T20:33:03.5079941Z     
2025-05-07T20:33:03.5080078Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5080174Z             op = silu_mul_quant
2025-05-07T20:33:03.5080262Z             if compiled:
2025-05-07T20:33:03.5080375Z                 op = torch.compile(op)
2025-05-07T20:33:03.5080483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5080559Z     
2025-05-07T20:33:03.5080660Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5080782Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5080864Z     
2025-05-07T20:33:03.5081008Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5081111Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5081221Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5081347Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5081487Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5081573Z     
2025-05-07T20:33:03.5081674Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5081679Z 
2025-05-07T20:33:03.5081779Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5081917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5082024Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5082162Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5082731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5082836Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5083201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5083426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5083790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5084053Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5084426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5084602Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5084943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5085030Z     fn()
2025-05-07T20:33:03.5085436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5085523Z     self.fn.run(
2025-05-07T20:33:03.5085857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5085957Z     kernel = self.compile(
2025-05-07T20:33:03.5086335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5086515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5086642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5086647Z 
2025-05-07T20:33:03.5086856Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b90668b60>
2025-05-07T20:33:03.5087723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5088264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668df240>}
2025-05-07T20:33:03.5089070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5089267Z context = <triton._C.libtriton.ir.context object at 0x7f8b6728d330>
2025-05-07T20:33:03.5089271Z 
2025-05-07T20:33:03.5089443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5089706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5089822Z                            module_map=module_map)
2025-05-07T20:33:03.5089994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5090099Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5090181Z E       ^
2025-05-07T20:33:03.5090550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5090555Z 
2025-05-07T20:33:03.5090965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5090971Z 
2025-05-07T20:33:03.5091084Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5091308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5091387Z     T=1,
2025-05-07T20:33:03.5091478Z     D=5120,
2025-05-07T20:33:03.5091564Z     scale_ub=1200.0,
2025-05-07T20:33:03.5091651Z     contiguous=True,
2025-05-07T20:33:03.5091743Z     compiled=True,
2025-05-07T20:33:03.5091821Z )
2025-05-07T20:33:03.5092047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5092220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5092228Z 
2025-05-07T20:33:03.5092311Z     @given(
2025-05-07T20:33:03.5092440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5092540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5092659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5092787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5092902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5092979Z     )
2025-05-07T20:33:03.5093231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5093327Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5093406Z         self,
2025-05-07T20:33:03.5093496Z         T: int,
2025-05-07T20:33:03.5093576Z         D: int,
2025-05-07T20:33:03.5093690Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5093786Z         contiguous: bool,
2025-05-07T20:33:03.5093876Z         compiled: bool,
2025-05-07T20:33:03.5093967Z     ) -> None:
2025-05-07T20:33:03.5094064Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5094141Z     
2025-05-07T20:33:03.5094325Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5094404Z     
2025-05-07T20:33:03.5094500Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5094633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5094725Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5094808Z         x0 = x[:, :D]
2025-05-07T20:33:03.5094900Z         x1 = x[:, D:]
2025-05-07T20:33:03.5094975Z     
2025-05-07T20:33:03.5095073Z         if contiguous:
2025-05-07T20:33:03.5095169Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5095261Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5095345Z     
2025-05-07T20:33:03.5095437Z         if scale_ub is not None:
2025-05-07T20:33:03.5095675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5095820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5095898Z             )
2025-05-07T20:33:03.5096040Z         else:
2025-05-07T20:33:03.5096147Z             scale_ub_tensor = None
2025-05-07T20:33:03.5096223Z     
2025-05-07T20:33:03.5096353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5096455Z             op = silu_mul_quant
2025-05-07T20:33:03.5096542Z             if compiled:
2025-05-07T20:33:03.5096644Z                 op = torch.compile(op)
2025-05-07T20:33:03.5096758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5096833Z     
2025-05-07T20:33:03.5096934Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5096938Z 
2025-05-07T20:33:03.5097039Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5097169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5097289Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5097393Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5097759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5097864Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5098356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5098466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5098825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5099049Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5099401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5099515Z     kernel = self.compile(
2025-05-07T20:33:03.5099924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5100107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5100238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5100243Z 
2025-05-07T20:33:03.5100455Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a1580>
2025-05-07T20:33:03.5101229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5101738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bb8cc0>}
2025-05-07T20:33:03.5102485Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5102678Z context = <triton._C.libtriton.ir.context object at 0x7f8a69b568b0>
2025-05-07T20:33:03.5102685Z 
2025-05-07T20:33:03.5102857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5103121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5103234Z                            module_map=module_map)
2025-05-07T20:33:03.5103395Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5103495Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5103582Z E       ^
2025-05-07T20:33:03.5103936Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5103941Z 
2025-05-07T20:33:03.5104443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5104493Z 
2025-05-07T20:33:03.5104599Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5104825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5104949Z     T=1,
2025-05-07T20:33:03.5105028Z     D=5120,
2025-05-07T20:33:03.5105113Z     scale_ub=None,
2025-05-07T20:33:03.5105208Z     contiguous=False,
2025-05-07T20:33:03.5105294Z     compiled=True,
2025-05-07T20:33:03.5105369Z )
2025-05-07T20:33:03.5105600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5105766Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5105770Z 
2025-05-07T20:33:03.5105857Z     @given(
2025-05-07T20:33:03.5105979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5106079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5106203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5106328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5106445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5106530Z     )
2025-05-07T20:33:03.5106777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5106875Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5106962Z         self,
2025-05-07T20:33:03.5107041Z         T: int,
2025-05-07T20:33:03.5107125Z         D: int,
2025-05-07T20:33:03.5107233Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5107327Z         contiguous: bool,
2025-05-07T20:33:03.5107423Z         compiled: bool,
2025-05-07T20:33:03.5107504Z     ) -> None:
2025-05-07T20:33:03.5107602Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5107685Z     
2025-05-07T20:33:03.5107855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5107932Z     
2025-05-07T20:33:03.5108034Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5108169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5108258Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5108349Z         x0 = x[:, :D]
2025-05-07T20:33:03.5108431Z         x1 = x[:, D:]
2025-05-07T20:33:03.5108510Z     
2025-05-07T20:33:03.5108603Z         if contiguous:
2025-05-07T20:33:03.5108697Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5108789Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5108876Z     
2025-05-07T20:33:03.5108971Z         if scale_ub is not None:
2025-05-07T20:33:03.5109085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5109220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5109300Z             )
2025-05-07T20:33:03.5109387Z         else:
2025-05-07T20:33:03.5109483Z             scale_ub_tensor = None
2025-05-07T20:33:03.5109560Z     
2025-05-07T20:33:03.5109698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5109791Z             op = silu_mul_quant
2025-05-07T20:33:03.5109884Z             if compiled:
2025-05-07T20:33:03.5109985Z                 op = torch.compile(op)
2025-05-07T20:33:03.5110100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5110177Z     
2025-05-07T20:33:03.5110270Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5110397Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5110471Z     
2025-05-07T20:33:03.5110613Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5110714Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5110815Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5110944Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5111083Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5111159Z     
2025-05-07T20:33:03.5111269Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5111273Z 
2025-05-07T20:33:03.5111372Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5111633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5111748Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5111883Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5112484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5112586Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5112943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5113170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5113536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5113802Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5114177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5114343Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5114689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5114768Z     fn()
2025-05-07T20:33:03.5115165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5115254Z     self.fn.run(
2025-05-07T20:33:03.5115590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5115690Z     kernel = self.compile(
2025-05-07T20:33:03.5116142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5116323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5116462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5116466Z 
2025-05-07T20:33:03.5116669Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66ba8ec0>
2025-05-07T20:33:03.5117455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5117963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bbb2e0>}
2025-05-07T20:33:03.5118706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5118908Z context = <triton._C.libtriton.ir.context object at 0x7f8a69bbd570>
2025-05-07T20:33:03.5118912Z 
2025-05-07T20:33:03.5119078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5119353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5119486Z                            module_map=module_map)
2025-05-07T20:33:03.5119672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5119783Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5119862Z E       ^
2025-05-07T20:33:03.5120217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5120228Z 
2025-05-07T20:33:03.5120638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5120643Z 
2025-05-07T20:33:03.5120747Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5121101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5121184Z     T=1,
2025-05-07T20:33:03.5121262Z     D=5120,
2025-05-07T20:33:03.5121351Z     scale_ub=None,
2025-05-07T20:33:03.5121478Z     contiguous=True,
2025-05-07T20:33:03.5121563Z     compiled=False,
2025-05-07T20:33:03.5121646Z )
2025-05-07T20:33:03.5121867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5122036Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5122041Z 
2025-05-07T20:33:03.5122119Z     @given(
2025-05-07T20:33:03.5122240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5122345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5122466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5122584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5122707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5122791Z     )
2025-05-07T20:33:03.5123043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5123137Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5123218Z         self,
2025-05-07T20:33:03.5123303Z         T: int,
2025-05-07T20:33:03.5123385Z         D: int,
2025-05-07T20:33:03.5123486Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5123585Z         contiguous: bool,
2025-05-07T20:33:03.5123671Z         compiled: bool,
2025-05-07T20:33:03.5123750Z     ) -> None:
2025-05-07T20:33:03.5123853Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5123928Z     
2025-05-07T20:33:03.5124098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5124181Z     
2025-05-07T20:33:03.5124277Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5124402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5124499Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5124589Z         x0 = x[:, :D]
2025-05-07T20:33:03.5124677Z         x1 = x[:, D:]
2025-05-07T20:33:03.5124754Z     
2025-05-07T20:33:03.5124838Z         if contiguous:
2025-05-07T20:33:03.5124938Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5125032Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5125107Z     
2025-05-07T20:33:03.5125206Z         if scale_ub is not None:
2025-05-07T20:33:03.5125313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5125448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5125537Z             )
2025-05-07T20:33:03.5125614Z         else:
2025-05-07T20:33:03.5125711Z             scale_ub_tensor = None
2025-05-07T20:33:03.5125797Z     
2025-05-07T20:33:03.5125928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5126025Z             op = silu_mul_quant
2025-05-07T20:33:03.5126113Z             if compiled:
2025-05-07T20:33:03.5126215Z                 op = torch.compile(op)
2025-05-07T20:33:03.5126339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5126414Z     
2025-05-07T20:33:03.5126506Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5126510Z 
2025-05-07T20:33:03.5126615Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5126751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5126854Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5126964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5127462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5127565Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5127921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5128145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5128575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5128733Z     kernel = self.compile(
2025-05-07T20:33:03.5129115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5129339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5129480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5129486Z 
2025-05-07T20:33:03.5129729Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630fa40>
2025-05-07T20:33:03.5130508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5131028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bbbc40>}
2025-05-07T20:33:03.5131772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5131965Z context = <triton._C.libtriton.ir.context object at 0x7f8a69b4e330>
2025-05-07T20:33:03.5131970Z 
2025-05-07T20:33:03.5132141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5132406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5132520Z                            module_map=module_map)
2025-05-07T20:33:03.5132682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5132781Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5132866Z E       ^
2025-05-07T20:33:03.5133230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5133237Z 
2025-05-07T20:33:03.5133649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5133663Z 
2025-05-07T20:33:03.5133767Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5133991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5134078Z     T=128,
2025-05-07T20:33:03.5134157Z     D=5120,
2025-05-07T20:33:03.5134244Z     scale_ub=None,
2025-05-07T20:33:03.5134338Z     contiguous=False,
2025-05-07T20:33:03.5134423Z     compiled=True,
2025-05-07T20:33:03.5134498Z )
2025-05-07T20:33:03.5134721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5134892Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5134896Z 
2025-05-07T20:33:03.5134976Z     @given(
2025-05-07T20:33:03.5135108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5135211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5135334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5135453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5135568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5135650Z     )
2025-05-07T20:33:03.5135895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5135989Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5136073Z         self,
2025-05-07T20:33:03.5136151Z         T: int,
2025-05-07T20:33:03.5136228Z         D: int,
2025-05-07T20:33:03.5136337Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5136428Z         contiguous: bool,
2025-05-07T20:33:03.5136521Z         compiled: bool,
2025-05-07T20:33:03.5136600Z     ) -> None:
2025-05-07T20:33:03.5136698Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5136782Z     
2025-05-07T20:33:03.5137077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5137153Z     
2025-05-07T20:33:03.5137254Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5137380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5137512Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5137602Z         x0 = x[:, :D]
2025-05-07T20:33:03.5137684Z         x1 = x[:, D:]
2025-05-07T20:33:03.5137759Z     
2025-05-07T20:33:03.5137850Z         if contiguous:
2025-05-07T20:33:03.5137943Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5138034Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5138115Z     
2025-05-07T20:33:03.5138207Z         if scale_ub is not None:
2025-05-07T20:33:03.5138319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5138454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5138532Z             )
2025-05-07T20:33:03.5138617Z         else:
2025-05-07T20:33:03.5138720Z             scale_ub_tensor = None
2025-05-07T20:33:03.5138799Z     
2025-05-07T20:33:03.5138936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5139029Z             op = silu_mul_quant
2025-05-07T20:33:03.5139120Z             if compiled:
2025-05-07T20:33:03.5139232Z                 op = torch.compile(op)
2025-05-07T20:33:03.5139339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5139418Z     
2025-05-07T20:33:03.5139545Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5139550Z 
2025-05-07T20:33:03.5139661Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5139809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5139911Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5140018Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5140399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5140495Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5141001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5141110Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5141471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5141704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5142045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5142143Z     kernel = self.compile(
2025-05-07T20:33:03.5142533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5142709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5142846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5142857Z 
2025-05-07T20:33:03.5143062Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b43c20>
2025-05-07T20:33:03.5143839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5144353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66bb9da0>}
2025-05-07T20:33:03.5145099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5145300Z context = <triton._C.libtriton.ir.context object at 0x7f8b66296030>
2025-05-07T20:33:03.5145304Z 
2025-05-07T20:33:03.5145556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5145908Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5146022Z                            module_map=module_map)
2025-05-07T20:33:03.5146224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5146329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5146410Z E       ^
2025-05-07T20:33:03.5146767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5146772Z 
2025-05-07T20:33:03.5147190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5147194Z 
2025-05-07T20:33:03.5147298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5147532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5147613Z     T=128,
2025-05-07T20:33:03.5147699Z     D=7168,
2025-05-07T20:33:03.5147790Z     scale_ub=1200.0,
2025-05-07T20:33:03.5147878Z     contiguous=False,
2025-05-07T20:33:03.5147967Z     compiled=False,
2025-05-07T20:33:03.5148050Z )
2025-05-07T20:33:03.5148268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5148442Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5148447Z 
2025-05-07T20:33:03.5148531Z     @given(
2025-05-07T20:33:03.5148651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5148758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5148873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5148990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5149112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5149188Z     )
2025-05-07T20:33:03.5149452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5149569Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5149663Z         self,
2025-05-07T20:33:03.5149750Z         T: int,
2025-05-07T20:33:03.5149834Z         D: int,
2025-05-07T20:33:03.5149937Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5150027Z         contiguous: bool,
2025-05-07T20:33:03.5150121Z         compiled: bool,
2025-05-07T20:33:03.5150202Z     ) -> None:
2025-05-07T20:33:03.5150305Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5150379Z     
2025-05-07T20:33:03.5150548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5150631Z     
2025-05-07T20:33:03.5150724Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5150849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5150943Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5151026Z         x0 = x[:, :D]
2025-05-07T20:33:03.5151108Z         x1 = x[:, D:]
2025-05-07T20:33:03.5151193Z     
2025-05-07T20:33:03.5151285Z         if contiguous:
2025-05-07T20:33:03.5151377Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5151476Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5151550Z     
2025-05-07T20:33:03.5151642Z         if scale_ub is not None:
2025-05-07T20:33:03.5151758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5151895Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5151981Z             )
2025-05-07T20:33:03.5152061Z         else:
2025-05-07T20:33:03.5152158Z             scale_ub_tensor = None
2025-05-07T20:33:03.5152239Z     
2025-05-07T20:33:03.5152369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5152461Z             op = silu_mul_quant
2025-05-07T20:33:03.5152553Z             if compiled:
2025-05-07T20:33:03.5152658Z                 op = torch.compile(op)
2025-05-07T20:33:03.5152765Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5152847Z     
2025-05-07T20:33:03.5152941Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5153067Z 
2025-05-07T20:33:03.5153176Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5153308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5153457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5153564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5154065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5154164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5154530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5154753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5155099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5155194Z     kernel = self.compile(
2025-05-07T20:33:03.5155586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5155834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5155969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5155973Z 
2025-05-07T20:33:03.5156179Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6633adb0>
2025-05-07T20:33:03.5156964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5157468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a0ae0>}
2025-05-07T20:33:03.5158224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5158416Z context = <triton._C.libtriton.ir.context object at 0x7f8b6626c2b0>
2025-05-07T20:33:03.5158423Z 
2025-05-07T20:33:03.5158595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5158859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5158967Z                            module_map=module_map)
2025-05-07T20:33:03.5159137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5159239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5159322Z E       ^
2025-05-07T20:33:03.5159737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5159741Z 
2025-05-07T20:33:03.5160159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5160165Z 
2025-05-07T20:33:03.5160275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5160502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5160581Z     T=128,
2025-05-07T20:33:03.5160667Z     D=5120,
2025-05-07T20:33:03.5160752Z     scale_ub=None,
2025-05-07T20:33:03.5160845Z     contiguous=False,
2025-05-07T20:33:03.5160938Z     compiled=False,
2025-05-07T20:33:03.5161014Z )
2025-05-07T20:33:03.5161239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5161410Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5161414Z 
2025-05-07T20:33:03.5161492Z     @given(
2025-05-07T20:33:03.5161618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5161719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5161983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5162109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5162224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5162345Z     )
2025-05-07T20:33:03.5162596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5162690Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5162776Z         self,
2025-05-07T20:33:03.5162857Z         T: int,
2025-05-07T20:33:03.5162935Z         D: int,
2025-05-07T20:33:03.5163041Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5163131Z         contiguous: bool,
2025-05-07T20:33:03.5163217Z         compiled: bool,
2025-05-07T20:33:03.5163307Z     ) -> None:
2025-05-07T20:33:03.5163403Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5163477Z     
2025-05-07T20:33:03.5163655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5163734Z     
2025-05-07T20:33:03.5163835Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5163964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5164055Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5164143Z         x0 = x[:, :D]
2025-05-07T20:33:03.5164227Z         x1 = x[:, D:]
2025-05-07T20:33:03.5164303Z     
2025-05-07T20:33:03.5164393Z         if contiguous:
2025-05-07T20:33:03.5164487Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5164579Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5164659Z     
2025-05-07T20:33:03.5164751Z         if scale_ub is not None:
2025-05-07T20:33:03.5164858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5164998Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5165076Z             )
2025-05-07T20:33:03.5165154Z         else:
2025-05-07T20:33:03.5165255Z             scale_ub_tensor = None
2025-05-07T20:33:03.5165330Z     
2025-05-07T20:33:03.5165798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5165940Z             op = silu_mul_quant
2025-05-07T20:33:03.5166029Z             if compiled:
2025-05-07T20:33:03.5166136Z                 op = torch.compile(op)
2025-05-07T20:33:03.5166242Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5166326Z     
2025-05-07T20:33:03.5166423Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5166428Z 
2025-05-07T20:33:03.5166527Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5166657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5166765Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5166868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5167365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5167469Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5167834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5168069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5168407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5168508Z     kernel = self.compile(
2025-05-07T20:33:03.5168895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5169069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5169204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5169208Z 
2025-05-07T20:33:03.5169421Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66338110>
2025-05-07T20:33:03.5170436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5171002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a1bc0>}
2025-05-07T20:33:03.5171812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5172007Z context = <triton._C.libtriton.ir.context object at 0x7f8a69ae8b70>
2025-05-07T20:33:03.5172012Z 
2025-05-07T20:33:03.5172176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5172439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5172552Z                            module_map=module_map)
2025-05-07T20:33:03.5172718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5172828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5172911Z E       ^
2025-05-07T20:33:03.5173269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5173278Z 
2025-05-07T20:33:03.5173696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5173701Z 
2025-05-07T20:33:03.5173804Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5174034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5174113Z     T=128,
2025-05-07T20:33:03.5174191Z     D=5120,
2025-05-07T20:33:03.5174282Z     scale_ub=1200.0,
2025-05-07T20:33:03.5174368Z     contiguous=True,
2025-05-07T20:33:03.5174453Z     compiled=False,
2025-05-07T20:33:03.5174539Z )
2025-05-07T20:33:03.5174758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5174938Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5174942Z 
2025-05-07T20:33:03.5175029Z     @given(
2025-05-07T20:33:03.5175148Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5175258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5175374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5175491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5175613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5175689Z     )
2025-05-07T20:33:03.5175932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5176032Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5176110Z         self,
2025-05-07T20:33:03.5176190Z         T: int,
2025-05-07T20:33:03.5176277Z         D: int,
2025-05-07T20:33:03.5176377Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5176467Z         contiguous: bool,
2025-05-07T20:33:03.5176567Z         compiled: bool,
2025-05-07T20:33:03.5176649Z     ) -> None:
2025-05-07T20:33:03.5176751Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5176830Z     
2025-05-07T20:33:03.5177005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5177087Z     
2025-05-07T20:33:03.5177181Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5177305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5177402Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5177483Z         x0 = x[:, :D]
2025-05-07T20:33:03.5177565Z         x1 = x[:, D:]
2025-05-07T20:33:03.5177650Z     
2025-05-07T20:33:03.5177734Z         if contiguous:
2025-05-07T20:33:03.5177827Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5177923Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5177999Z     
2025-05-07T20:33:03.5178091Z         if scale_ub is not None:
2025-05-07T20:33:03.5178208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5178547Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5178637Z             )
2025-05-07T20:33:03.5178716Z         else:
2025-05-07T20:33:03.5178814Z             scale_ub_tensor = None
2025-05-07T20:33:03.5178937Z     
2025-05-07T20:33:03.5179067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5179160Z             op = silu_mul_quant
2025-05-07T20:33:03.5179255Z             if compiled:
2025-05-07T20:33:03.5179360Z                 op = torch.compile(op)
2025-05-07T20:33:03.5179489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5179579Z     
2025-05-07T20:33:03.5179689Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5179693Z 
2025-05-07T20:33:03.5179798Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5179932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5180035Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5180145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5180651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5180751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5181122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5181348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5181694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5181788Z     kernel = self.compile(
2025-05-07T20:33:03.5182170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5182353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5182481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5182491Z 
2025-05-07T20:33:03.5182699Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed4410>
2025-05-07T20:33:03.5183483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5183989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b663a2ca0>}
2025-05-07T20:33:03.5184740Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5184930Z context = <triton._C.libtriton.ir.context object at 0x7f8a69e38270>
2025-05-07T20:33:03.5184935Z 
2025-05-07T20:33:03.5185111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5185375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5185485Z                            module_map=module_map)
2025-05-07T20:33:03.5185652Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5185753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5185833Z E       ^
2025-05-07T20:33:03.5186194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5186198Z 
2025-05-07T20:33:03.5186609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5186614Z 
2025-05-07T20:33:03.5186723Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5186946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5187153Z     T=1,
2025-05-07T20:33:03.5187239Z     D=7168,
2025-05-07T20:33:03.5187323Z     scale_ub=1200.0,
2025-05-07T20:33:03.5187409Z     contiguous=True,
2025-05-07T20:33:03.5187498Z     compiled=True,
2025-05-07T20:33:03.5187615Z )
2025-05-07T20:33:03.5187840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5188005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5188009Z 
2025-05-07T20:33:03.5188086Z     @given(
2025-05-07T20:33:03.5188211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5188310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5188427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5188550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5188664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5188740Z     )
2025-05-07T20:33:03.5188999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5189095Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5189178Z         self,
2025-05-07T20:33:03.5189257Z         T: int,
2025-05-07T20:33:03.5189338Z         D: int,
2025-05-07T20:33:03.5189446Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5189545Z         contiguous: bool,
2025-05-07T20:33:03.5189651Z         compiled: bool,
2025-05-07T20:33:03.5189748Z     ) -> None:
2025-05-07T20:33:03.5189861Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5189935Z     
2025-05-07T20:33:03.5190110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5190188Z     
2025-05-07T20:33:03.5190280Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5190422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5190512Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5190594Z         x0 = x[:, :D]
2025-05-07T20:33:03.5196600Z         x1 = x[:, D:]
2025-05-07T20:33:03.5196696Z     
2025-05-07T20:33:03.5196808Z         if contiguous:
2025-05-07T20:33:03.5196903Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5196995Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5197078Z     
2025-05-07T20:33:03.5197174Z         if scale_ub is not None:
2025-05-07T20:33:03.5197284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5197432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5197510Z             )
2025-05-07T20:33:03.5197596Z         else:
2025-05-07T20:33:03.5197692Z             scale_ub_tensor = None
2025-05-07T20:33:03.5197767Z     
2025-05-07T20:33:03.5197907Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5197998Z             op = silu_mul_quant
2025-05-07T20:33:03.5198086Z             if compiled:
2025-05-07T20:33:03.5198193Z                 op = torch.compile(op)
2025-05-07T20:33:03.5198298Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5198372Z     
2025-05-07T20:33:03.5198483Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5198490Z 
2025-05-07T20:33:03.5198589Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5198720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5198831Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5198932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5199315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5199411Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5199952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5200059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5200414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5200634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5201181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5201287Z     kernel = self.compile(
2025-05-07T20:33:03.5201710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5201886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5202023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5202028Z 
2025-05-07T20:33:03.5202235Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed4c80>
2025-05-07T20:33:03.5203020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5203530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef4360>}
2025-05-07T20:33:03.5204282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5204476Z context = <triton._C.libtriton.ir.context object at 0x7f8a69e20130>
2025-05-07T20:33:03.5204480Z 
2025-05-07T20:33:03.5204644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5204916Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5205027Z                            module_map=module_map)
2025-05-07T20:33:03.5205188Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5205297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5205379Z E       ^
2025-05-07T20:33:03.5205749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5205754Z 
2025-05-07T20:33:03.5206167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5206175Z 
2025-05-07T20:33:03.5206279Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5206512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5206592Z     T=1,
2025-05-07T20:33:03.5206671Z     D=7168,
2025-05-07T20:33:03.5206763Z     scale_ub=1200.0,
2025-05-07T20:33:03.5206851Z     contiguous=False,
2025-05-07T20:33:03.5206942Z     compiled=True,
2025-05-07T20:33:03.5207019Z )
2025-05-07T20:33:03.5207237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5207408Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5207413Z 
2025-05-07T20:33:03.5207498Z     @given(
2025-05-07T20:33:03.5207620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5207729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5207845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5207966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5208087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5208165Z     )
2025-05-07T20:33:03.5208416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5208512Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5208591Z         self,
2025-05-07T20:33:03.5208677Z         T: int,
2025-05-07T20:33:03.5208756Z         D: int,
2025-05-07T20:33:03.5208858Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5208957Z         contiguous: bool,
2025-05-07T20:33:03.5209045Z         compiled: bool,
2025-05-07T20:33:03.5209125Z     ) -> None:
2025-05-07T20:33:03.5209314Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5209427Z     
2025-05-07T20:33:03.5209597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5209679Z     
2025-05-07T20:33:03.5209773Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5209948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5210038Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5210120Z         x0 = x[:, :D]
2025-05-07T20:33:03.5210209Z         x1 = x[:, D:]
2025-05-07T20:33:03.5210283Z     
2025-05-07T20:33:03.5210369Z         if contiguous:
2025-05-07T20:33:03.5210470Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5210561Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5210635Z     
2025-05-07T20:33:03.5210735Z         if scale_ub is not None:
2025-05-07T20:33:03.5210842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5210977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5211064Z             )
2025-05-07T20:33:03.5211149Z         else:
2025-05-07T20:33:03.5211253Z             scale_ub_tensor = None
2025-05-07T20:33:03.5211329Z     
2025-05-07T20:33:03.5211459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5211561Z             op = silu_mul_quant
2025-05-07T20:33:03.5211648Z             if compiled:
2025-05-07T20:33:03.5211750Z                 op = torch.compile(op)
2025-05-07T20:33:03.5211862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5211936Z     
2025-05-07T20:33:03.5212029Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5212034Z 
2025-05-07T20:33:03.5212139Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5212269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5212377Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5212477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5212846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5212954Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5213446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5213547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5213908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5214129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5214473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5214566Z     kernel = self.compile(
2025-05-07T20:33:03.5214945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5215125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5215256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5215263Z 
2025-05-07T20:33:03.5215468Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed62d0>
2025-05-07T20:33:03.5216253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5216756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef59e0>}
2025-05-07T20:33:03.5217505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5217698Z context = <triton._C.libtriton.ir.context object at 0x7f8a69cf85f0>
2025-05-07T20:33:03.5217828Z 
2025-05-07T20:33:03.5218002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5218266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5218422Z                            module_map=module_map)
2025-05-07T20:33:03.5218593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5218696Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5218775Z E       ^
2025-05-07T20:33:03.5219139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5219144Z 
2025-05-07T20:33:03.5219581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5219585Z 
2025-05-07T20:33:03.5219719Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5219948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5220029Z     T=1,
2025-05-07T20:33:03.5220117Z     D=7168,
2025-05-07T20:33:03.5220204Z     scale_ub=None,
2025-05-07T20:33:03.5220294Z     contiguous=False,
2025-05-07T20:33:03.5220390Z     compiled=True,
2025-05-07T20:33:03.5220467Z )
2025-05-07T20:33:03.5220697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5220861Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5220865Z 
2025-05-07T20:33:03.5220947Z     @given(
2025-05-07T20:33:03.5221078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5221178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5221295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5221422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5221537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5221614Z     )
2025-05-07T20:33:03.5221873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5221968Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5222054Z         self,
2025-05-07T20:33:03.5222135Z         T: int,
2025-05-07T20:33:03.5222214Z         D: int,
2025-05-07T20:33:03.5222324Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5222415Z         contiguous: bool,
2025-05-07T20:33:03.5222502Z         compiled: bool,
2025-05-07T20:33:03.5222590Z     ) -> None:
2025-05-07T20:33:03.5222689Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5222765Z     
2025-05-07T20:33:03.5222941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5223017Z     
2025-05-07T20:33:03.5223113Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5223247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5223337Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5223430Z         x0 = x[:, :D]
2025-05-07T20:33:03.5223512Z         x1 = x[:, D:]
2025-05-07T20:33:03.5223593Z     
2025-05-07T20:33:03.5223688Z         if contiguous:
2025-05-07T20:33:03.5223782Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5223875Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5223962Z     
2025-05-07T20:33:03.5224054Z         if scale_ub is not None:
2025-05-07T20:33:03.5224162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5224306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5224386Z             )
2025-05-07T20:33:03.5224463Z         else:
2025-05-07T20:33:03.5224566Z             scale_ub_tensor = None
2025-05-07T20:33:03.5224641Z     
2025-05-07T20:33:03.5224778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5224870Z             op = silu_mul_quant
2025-05-07T20:33:03.5224956Z             if compiled:
2025-05-07T20:33:03.5225064Z                 op = torch.compile(op)
2025-05-07T20:33:03.5225172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5225936Z     
2025-05-07T20:33:03.5226042Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.5226164Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.5226240Z     
2025-05-07T20:33:03.5226450Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5226553Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.5226654Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.5226783Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.5226923Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5227007Z     
2025-05-07T20:33:03.5227110Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.5227115Z 
2025-05-07T20:33:03.5227213Z moe/activation_test.py:126: 
2025-05-07T20:33:03.5227354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5227459Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.5227604Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.5228171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.5228277Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.5228641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5228862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5229226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.5229499Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.5229921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.5230098Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.5230440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.5230520Z     fn()
2025-05-07T20:33:03.5230928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.5231015Z     self.fn.run(
2025-05-07T20:33:03.5231350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5231450Z     kernel = self.compile(
2025-05-07T20:33:03.5231830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5232013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5232142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5232147Z 
2025-05-07T20:33:03.5232357Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69ed79b0>
2025-05-07T20:33:03.5233142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5233648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef6700>}
2025-05-07T20:33:03.5234397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5234587Z context = <triton._C.libtriton.ir.context object at 0x7f8a69abb830>
2025-05-07T20:33:03.5234592Z 
2025-05-07T20:33:03.5234755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5235115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5235268Z                            module_map=module_map)
2025-05-07T20:33:03.5235439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5235583Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.5235664Z E       ^
2025-05-07T20:33:03.5236093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5236098Z 
2025-05-07T20:33:03.5236513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5236518Z 
2025-05-07T20:33:03.5236633Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5236857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5236938Z     T=1,
2025-05-07T20:33:03.5237027Z     D=5120,
2025-05-07T20:33:03.5237112Z     scale_ub=1200.0,
2025-05-07T20:33:03.5237209Z     contiguous=False,
2025-05-07T20:33:03.5237301Z     compiled=True,
2025-05-07T20:33:03.5237377Z )
2025-05-07T20:33:03.5237596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5237773Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5237778Z 
2025-05-07T20:33:03.5237855Z     @given(
2025-05-07T20:33:03.5237984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5238085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5238203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5238331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5238446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5238524Z     )
2025-05-07T20:33:03.5238775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5238869Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5238954Z         self,
2025-05-07T20:33:03.5239043Z         T: int,
2025-05-07T20:33:03.5239123Z         D: int,
2025-05-07T20:33:03.5239222Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5239322Z         contiguous: bool,
2025-05-07T20:33:03.5239424Z         compiled: bool,
2025-05-07T20:33:03.5239524Z     ) -> None:
2025-05-07T20:33:03.5239637Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5239718Z     
2025-05-07T20:33:03.5239899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5239978Z     
2025-05-07T20:33:03.5240070Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5240208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5240302Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5240387Z         x0 = x[:, :D]
2025-05-07T20:33:03.5240473Z         x1 = x[:, D:]
2025-05-07T20:33:03.5240554Z     
2025-05-07T20:33:03.5240640Z         if contiguous:
2025-05-07T20:33:03.5240734Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5240839Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5240912Z     
2025-05-07T20:33:03.5241003Z         if scale_ub is not None:
2025-05-07T20:33:03.5241118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5241253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5241337Z             )
2025-05-07T20:33:03.5241415Z         else:
2025-05-07T20:33:03.5241510Z             scale_ub_tensor = None
2025-05-07T20:33:03.5241592Z     
2025-05-07T20:33:03.5241726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5241819Z             op = silu_mul_quant
2025-05-07T20:33:03.5241914Z             if compiled:
2025-05-07T20:33:03.5242014Z                 op = torch.compile(op)
2025-05-07T20:33:03.5242120Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5242199Z     
2025-05-07T20:33:03.5242290Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5242294Z 
2025-05-07T20:33:03.5242398Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5242660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5242762Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5242872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5243280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5243374Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5243870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5243967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5244326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5244548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5244892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5244993Z     kernel = self.compile(
2025-05-07T20:33:03.5245372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5245551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5245683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5245688Z 
2025-05-07T20:33:03.5245892Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aeca40>
2025-05-07T20:33:03.5246675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5247180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef7e20>}
2025-05-07T20:33:03.5247929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5248120Z context = <triton._C.libtriton.ir.context object at 0x7f8a69d2bef0>
2025-05-07T20:33:03.5248124Z 
2025-05-07T20:33:03.5248287Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5248554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5248661Z                            module_map=module_map)
2025-05-07T20:33:03.5248823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5248930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5249008Z E       ^
2025-05-07T20:33:03.5249373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5249381Z 
2025-05-07T20:33:03.5249841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5249848Z 
2025-05-07T20:33:03.5249950Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5250177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5250256Z     T=1,
2025-05-07T20:33:03.5250339Z     D=5120,
2025-05-07T20:33:03.5250422Z     scale_ub=1200.0,
2025-05-07T20:33:03.5250509Z     contiguous=False,
2025-05-07T20:33:03.5250599Z     compiled=False,
2025-05-07T20:33:03.5250672Z )
2025-05-07T20:33:03.5250891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5251062Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5251066Z 
2025-05-07T20:33:03.5251143Z     @given(
2025-05-07T20:33:03.5251345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5251489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5251606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5251731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5251887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5251961Z     )
2025-05-07T20:33:03.5252210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5252304Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5252381Z         self,
2025-05-07T20:33:03.5252467Z         T: int,
2025-05-07T20:33:03.5252545Z         D: int,
2025-05-07T20:33:03.5252645Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5252743Z         contiguous: bool,
2025-05-07T20:33:03.5252828Z         compiled: bool,
2025-05-07T20:33:03.5252909Z     ) -> None:
2025-05-07T20:33:03.5253012Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5253087Z     
2025-05-07T20:33:03.5253264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5253344Z     
2025-05-07T20:33:03.5253436Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5253567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5253659Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5253744Z         x0 = x[:, :D]
2025-05-07T20:33:03.5253832Z         x1 = x[:, D:]
2025-05-07T20:33:03.5253906Z     
2025-05-07T20:33:03.5253993Z         if contiguous:
2025-05-07T20:33:03.5254092Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5254182Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5254256Z     
2025-05-07T20:33:03.5254356Z         if scale_ub is not None:
2025-05-07T20:33:03.5254462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5254600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5254684Z             )
2025-05-07T20:33:03.5254761Z         else:
2025-05-07T20:33:03.5254863Z             scale_ub_tensor = None
2025-05-07T20:33:03.5254945Z     
2025-05-07T20:33:03.5255075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5255174Z             op = silu_mul_quant
2025-05-07T20:33:03.5255261Z             if compiled:
2025-05-07T20:33:03.5255365Z                 op = torch.compile(op)
2025-05-07T20:33:03.5255477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5255553Z     
2025-05-07T20:33:03.5255645Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5255649Z 
2025-05-07T20:33:03.5255754Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5255882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5255989Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5256089Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5256586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5256689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5257056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5257277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5257626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5257721Z     kernel = self.compile(
2025-05-07T20:33:03.5258106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5258279Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5258404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5258408Z 
2025-05-07T20:33:03.5258618Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630c560>
2025-05-07T20:33:03.5259477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5260050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66666480>}
2025-05-07T20:33:03.5260834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5261022Z context = <triton._C.libtriton.ir.context object at 0x7f8b666d8eb0>
2025-05-07T20:33:03.5261033Z 
2025-05-07T20:33:03.5261199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5261461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5261579Z                            module_map=module_map)
2025-05-07T20:33:03.5261741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5261842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5261928Z E       ^
2025-05-07T20:33:03.5262284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5262289Z 
2025-05-07T20:33:03.5262705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5262709Z 
2025-05-07T20:33:03.5262813Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5263036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5263123Z     T=16384,
2025-05-07T20:33:03.5263201Z     D=5120,
2025-05-07T20:33:03.5263285Z     scale_ub=1200.0,
2025-05-07T20:33:03.5263377Z     contiguous=False,
2025-05-07T20:33:03.5263462Z     compiled=True,
2025-05-07T20:33:03.5263539Z )
2025-05-07T20:33:03.5263769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5263948Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5263955Z 
2025-05-07T20:33:03.5264039Z     @given(
2025-05-07T20:33:03.5264158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5264257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5264381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5264501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5264614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5264696Z     )
2025-05-07T20:33:03.5264937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5265035Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5265114Z         self,
2025-05-07T20:33:03.5265192Z         T: int,
2025-05-07T20:33:03.5265274Z         D: int,
2025-05-07T20:33:03.5265636Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5265771Z         contiguous: bool,
2025-05-07T20:33:03.5265901Z         compiled: bool,
2025-05-07T20:33:03.5266012Z     ) -> None:
2025-05-07T20:33:03.5266118Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5266200Z     
2025-05-07T20:33:03.5266366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5266443Z     
2025-05-07T20:33:03.5266541Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5266667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5266757Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5266845Z         x0 = x[:, :D]
2025-05-07T20:33:03.5266926Z         x1 = x[:, D:]
2025-05-07T20:33:03.5267005Z     
2025-05-07T20:33:03.5267089Z         if contiguous:
2025-05-07T20:33:03.5267180Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5267276Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5267349Z     
2025-05-07T20:33:03.5267656Z         if scale_ub is not None:
2025-05-07T20:33:03.5267833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5267968Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5268106Z             )
2025-05-07T20:33:03.5268190Z         else:
2025-05-07T20:33:03.5268286Z             scale_ub_tensor = None
2025-05-07T20:33:03.5268360Z     
2025-05-07T20:33:03.5268496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5268588Z             op = silu_mul_quant
2025-05-07T20:33:03.5268681Z             if compiled:
2025-05-07T20:33:03.5268783Z                 op = torch.compile(op)
2025-05-07T20:33:03.5268889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5268969Z     
2025-05-07T20:33:03.5269060Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5269064Z 
2025-05-07T20:33:03.5269162Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5269300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5269415Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5269516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5269889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5269986Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5270483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5270581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5270937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5271167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5271507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5271601Z     kernel = self.compile(
2025-05-07T20:33:03.5271994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5272167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5272302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5272307Z 
2025-05-07T20:33:03.5272510Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6630c860>
2025-05-07T20:33:03.5273282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5273789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66667ce0>}
2025-05-07T20:33:03.5274537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5274735Z context = <triton._C.libtriton.ir.context object at 0x7f8a69db1db0>
2025-05-07T20:33:03.5274741Z 
2025-05-07T20:33:03.5274905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5275174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5275283Z                            module_map=module_map)
2025-05-07T20:33:03.5275443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5275549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5275629Z E       ^
2025-05-07T20:33:03.5276046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5276051Z 
2025-05-07T20:33:03.5276555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5276596Z 
2025-05-07T20:33:03.5276702Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5276934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5277055Z     T=2048,
2025-05-07T20:33:03.5277138Z     D=7168,
2025-05-07T20:33:03.5277228Z     scale_ub=1200.0,
2025-05-07T20:33:03.5277317Z     contiguous=False,
2025-05-07T20:33:03.5277403Z     compiled=True,
2025-05-07T20:33:03.5277484Z )
2025-05-07T20:33:03.5277701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5277877Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5277891Z 
2025-05-07T20:33:03.5277970Z     @given(
2025-05-07T20:33:03.5278089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5278195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5278315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5278437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5278557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5278639Z     )
2025-05-07T20:33:03.5278882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5278984Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5279065Z         self,
2025-05-07T20:33:03.5279151Z         T: int,
2025-05-07T20:33:03.5279231Z         D: int,
2025-05-07T20:33:03.5279332Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5279427Z         contiguous: bool,
2025-05-07T20:33:03.5279517Z         compiled: bool,
2025-05-07T20:33:03.5279598Z     ) -> None:
2025-05-07T20:33:03.5279700Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5279779Z     
2025-05-07T20:33:03.5279946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5280028Z     
2025-05-07T20:33:03.5280126Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5280253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5280352Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5280434Z         x0 = x[:, :D]
2025-05-07T20:33:03.5280519Z         x1 = x[:, D:]
2025-05-07T20:33:03.5280603Z     
2025-05-07T20:33:03.5280688Z         if contiguous:
2025-05-07T20:33:03.5280787Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5280879Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5280957Z     
2025-05-07T20:33:03.5281059Z         if scale_ub is not None:
2025-05-07T20:33:03.5281167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5281302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5281386Z             )
2025-05-07T20:33:03.5281466Z         else:
2025-05-07T20:33:03.5281563Z             scale_ub_tensor = None
2025-05-07T20:33:03.5281646Z     
2025-05-07T20:33:03.5281776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5281876Z             op = silu_mul_quant
2025-05-07T20:33:03.5281968Z             if compiled:
2025-05-07T20:33:03.5282069Z                 op = torch.compile(op)
2025-05-07T20:33:03.5282179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5282256Z     
2025-05-07T20:33:03.5282348Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5282353Z 
2025-05-07T20:33:03.5282456Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5282588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5282691Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5282799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5283164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5283259Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5283757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5283987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5284352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5284617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5284953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5285053Z     kernel = self.compile(
2025-05-07T20:33:03.5285432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5285613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5285741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5285745Z 
2025-05-07T20:33:03.5285946Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619cf50>
2025-05-07T20:33:03.5286735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5287238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b666679c0>}
2025-05-07T20:33:03.5287986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5288176Z context = <triton._C.libtriton.ir.context object at 0x7f8a69db5ab0>
2025-05-07T20:33:03.5288180Z 
2025-05-07T20:33:03.5288343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5288615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5288726Z                            module_map=module_map)
2025-05-07T20:33:03.5288896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5289000Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5289079Z E       ^
2025-05-07T20:33:03.5289441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5289446Z 
2025-05-07T20:33:03.5289856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5289860Z 
2025-05-07T20:33:03.5289973Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5290197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5290276Z     T=1,
2025-05-07T20:33:03.5290365Z     D=5120,
2025-05-07T20:33:03.5290450Z     scale_ub=None,
2025-05-07T20:33:03.5290540Z     contiguous=False,
2025-05-07T20:33:03.5290641Z     compiled=False,
2025-05-07T20:33:03.5290716Z )
2025-05-07T20:33:03.5290937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5291111Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5291121Z 
2025-05-07T20:33:03.5291200Z     @given(
2025-05-07T20:33:03.5291325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5291425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5291541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5291664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5291779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5291854Z     )
2025-05-07T20:33:03.5292104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5292198Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5292275Z         self,
2025-05-07T20:33:03.5292360Z         T: int,
2025-05-07T20:33:03.5292593Z         D: int,
2025-05-07T20:33:03.5292694Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5292791Z         contiguous: bool,
2025-05-07T20:33:03.5292877Z         compiled: bool,
2025-05-07T20:33:03.5293004Z     ) -> None:
2025-05-07T20:33:03.5293100Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5293174Z     
2025-05-07T20:33:03.5293348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5293425Z     
2025-05-07T20:33:03.5293519Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5293652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5293742Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5293828Z         x0 = x[:, :D]
2025-05-07T20:33:03.5293918Z         x1 = x[:, D:]
2025-05-07T20:33:03.5293993Z     
2025-05-07T20:33:03.5294077Z         if contiguous:
2025-05-07T20:33:03.5294176Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5294264Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5294352Z     
2025-05-07T20:33:03.5294444Z         if scale_ub is not None:
2025-05-07T20:33:03.5294551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5294690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5294770Z             )
2025-05-07T20:33:03.5294847Z         else:
2025-05-07T20:33:03.5294951Z             scale_ub_tensor = None
2025-05-07T20:33:03.5295025Z     
2025-05-07T20:33:03.5295154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5295251Z             op = silu_mul_quant
2025-05-07T20:33:03.5295337Z             if compiled:
2025-05-07T20:33:03.5295444Z                 op = torch.compile(op)
2025-05-07T20:33:03.5295556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5295631Z     
2025-05-07T20:33:03.5295725Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5295729Z 
2025-05-07T20:33:03.5295839Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5295975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5296084Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5296184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5296679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5296786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5297142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5297362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5297705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5297799Z     kernel = self.compile(
2025-05-07T20:33:03.5298190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5298371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5298503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5298510Z 
2025-05-07T20:33:03.5298722Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619c200>
2025-05-07T20:33:03.5299495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5300001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668dd800>}
2025-05-07T20:33:03.5300829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5301063Z context = <triton._C.libtriton.ir.context object at 0x7f8a69994930>
2025-05-07T20:33:03.5301067Z 
2025-05-07T20:33:03.5301231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5301533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5301646Z                            module_map=module_map)
2025-05-07T20:33:03.5301808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5301909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5301992Z E       ^
2025-05-07T20:33:03.5302347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5302351Z 
2025-05-07T20:33:03.5302768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5302773Z 
2025-05-07T20:33:03.5302886Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5303109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5303195Z     T=4096,
2025-05-07T20:33:03.5303279Z     D=7168,
2025-05-07T20:33:03.5303365Z     scale_ub=1200.0,
2025-05-07T20:33:03.5303461Z     contiguous=False,
2025-05-07T20:33:03.5303547Z     compiled=False,
2025-05-07T20:33:03.5303630Z )
2025-05-07T20:33:03.5303848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5304026Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5304030Z 
2025-05-07T20:33:03.5304116Z     @given(
2025-05-07T20:33:03.5304236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5304337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5304460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5304579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5304700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5304786Z     )
2025-05-07T20:33:03.5305028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5305133Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5305212Z         self,
2025-05-07T20:33:03.5305292Z         T: int,
2025-05-07T20:33:03.5305379Z         D: int,
2025-05-07T20:33:03.5305481Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5305572Z         contiguous: bool,
2025-05-07T20:33:03.5305667Z         compiled: bool,
2025-05-07T20:33:03.5305749Z     ) -> None:
2025-05-07T20:33:03.5305847Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5305931Z     
2025-05-07T20:33:03.5306101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5306177Z     
2025-05-07T20:33:03.5306283Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5306409Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5306515Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5306601Z         x0 = x[:, :D]
2025-05-07T20:33:03.5306684Z         x1 = x[:, D:]
2025-05-07T20:33:03.5306766Z     
2025-05-07T20:33:03.5306853Z         if contiguous:
2025-05-07T20:33:03.5306948Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5307045Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5307123Z     
2025-05-07T20:33:03.5307217Z         if scale_ub is not None:
2025-05-07T20:33:03.5307333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5307472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5307551Z             )
2025-05-07T20:33:03.5307637Z         else:
2025-05-07T20:33:03.5307735Z             scale_ub_tensor = None
2025-05-07T20:33:03.5307810Z     
2025-05-07T20:33:03.5307950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5308043Z             op = silu_mul_quant
2025-05-07T20:33:03.5308136Z             if compiled:
2025-05-07T20:33:03.5308327Z                 op = torch.compile(op)
2025-05-07T20:33:03.5308476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5308557Z     
2025-05-07T20:33:03.5308649Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5308690Z 
2025-05-07T20:33:03.5308790Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5308926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5309031Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5309132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5309634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5309732Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5310097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5310318Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5310664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5310766Z     kernel = self.compile(
2025-05-07T20:33:03.5311150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5311332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5311461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5311465Z 
2025-05-07T20:33:03.5311669Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66ba8d70>
2025-05-07T20:33:03.5312453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5312962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66e85300>}
2025-05-07T20:33:03.5313710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5313904Z context = <triton._C.libtriton.ir.context object at 0x7f8a69c36070>
2025-05-07T20:33:03.5313909Z 
2025-05-07T20:33:03.5314073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5314342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5314450Z                            module_map=module_map)
2025-05-07T20:33:03.5314617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5314718Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5314796Z E       ^
2025-05-07T20:33:03.5315163Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5315167Z 
2025-05-07T20:33:03.5315577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5315584Z 
2025-05-07T20:33:03.5315696Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5316055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5316137Z     T=16384,
2025-05-07T20:33:03.5316222Z     D=7168,
2025-05-07T20:33:03.5316307Z     scale_ub=None,
2025-05-07T20:33:03.5316395Z     contiguous=True,
2025-05-07T20:33:03.5316487Z     compiled=True,
2025-05-07T20:33:03.5316563Z )
2025-05-07T20:33:03.5316781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5317366Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5317371Z 
2025-05-07T20:33:03.5317675Z     @given(
2025-05-07T20:33:03.5317805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5317906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5318022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5318195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5318311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5318390Z     )
2025-05-07T20:33:03.5318642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5318738Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5318818Z         self,
2025-05-07T20:33:03.5318908Z         T: int,
2025-05-07T20:33:03.5318987Z         D: int,
2025-05-07T20:33:03.5319094Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5319199Z         contiguous: bool,
2025-05-07T20:33:03.5319287Z         compiled: bool,
2025-05-07T20:33:03.5319372Z     ) -> None:
2025-05-07T20:33:03.5325656Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5325765Z     
2025-05-07T20:33:03.5325953Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5326029Z     
2025-05-07T20:33:03.5326127Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5326267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5326359Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5326443Z         x0 = x[:, :D]
2025-05-07T20:33:03.5326535Z         x1 = x[:, D:]
2025-05-07T20:33:03.5326610Z     
2025-05-07T20:33:03.5326701Z         if contiguous:
2025-05-07T20:33:03.5326794Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5326884Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5326966Z     
2025-05-07T20:33:03.5327058Z         if scale_ub is not None:
2025-05-07T20:33:03.5327166Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5327310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5327387Z             )
2025-05-07T20:33:03.5327469Z         else:
2025-05-07T20:33:03.5327580Z             scale_ub_tensor = None
2025-05-07T20:33:03.5327655Z     
2025-05-07T20:33:03.5327787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5327894Z             op = silu_mul_quant
2025-05-07T20:33:03.5327983Z             if compiled:
2025-05-07T20:33:03.5328094Z                 op = torch.compile(op)
2025-05-07T20:33:03.5328199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5328275Z     
2025-05-07T20:33:03.5328372Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5328376Z 
2025-05-07T20:33:03.5328476Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5328615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5328717Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5328820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5329209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5329311Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5329812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5329914Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5330271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5330503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5330843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5330946Z     kernel = self.compile(
2025-05-07T20:33:03.5331329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5331511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5331815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5331860Z 
2025-05-07T20:33:03.5332068Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666ab0e0>
2025-05-07T20:33:03.5332897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5333402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90a65440>}
2025-05-07T20:33:03.5334147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5334345Z context = <triton._C.libtriton.ir.context object at 0x7f8a69797530>
2025-05-07T20:33:03.5334356Z 
2025-05-07T20:33:03.5334522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5334801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5334918Z                            module_map=module_map)
2025-05-07T20:33:03.5335081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5335190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5335270Z E       ^
2025-05-07T20:33:03.5335624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5335636Z 
2025-05-07T20:33:03.5336047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5336052Z 
2025-05-07T20:33:03.5336158Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5336395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5336480Z     T=4096,
2025-05-07T20:33:03.5336559Z     D=5120,
2025-05-07T20:33:03.5336652Z     scale_ub=None,
2025-05-07T20:33:03.5336741Z     contiguous=False,
2025-05-07T20:33:03.5336829Z     compiled=True,
2025-05-07T20:33:03.5336917Z )
2025-05-07T20:33:03.5337137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5337319Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5337324Z 
2025-05-07T20:33:03.5337404Z     @given(
2025-05-07T20:33:03.5337526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5337634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5337752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5337869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5337992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5338069Z     )
2025-05-07T20:33:03.5338322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5338427Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5338506Z         self,
2025-05-07T20:33:03.5338594Z         T: int,
2025-05-07T20:33:03.5338676Z         D: int,
2025-05-07T20:33:03.5338776Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5338877Z         contiguous: bool,
2025-05-07T20:33:03.5338963Z         compiled: bool,
2025-05-07T20:33:03.5339044Z     ) -> None:
2025-05-07T20:33:03.5339148Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5339222Z     
2025-05-07T20:33:03.5339389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5339473Z     
2025-05-07T20:33:03.5339567Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5339692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5339789Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5339871Z         x0 = x[:, :D]
2025-05-07T20:33:03.5339963Z         x1 = x[:, D:]
2025-05-07T20:33:03.5340162Z     
2025-05-07T20:33:03.5340249Z         if contiguous:
2025-05-07T20:33:03.5340351Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5340443Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5340557Z     
2025-05-07T20:33:03.5340658Z         if scale_ub is not None:
2025-05-07T20:33:03.5340766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5340901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5340988Z             )
2025-05-07T20:33:03.5341065Z         else:
2025-05-07T20:33:03.5341162Z             scale_ub_tensor = None
2025-05-07T20:33:03.5341245Z     
2025-05-07T20:33:03.5341374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5341474Z             op = silu_mul_quant
2025-05-07T20:33:03.5341560Z             if compiled:
2025-05-07T20:33:03.5341662Z                 op = torch.compile(op)
2025-05-07T20:33:03.5341774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5341855Z     
2025-05-07T20:33:03.5341954Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5341959Z 
2025-05-07T20:33:03.5342064Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5342194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5342304Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5342411Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5342776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5342877Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5343368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5343465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5343827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5344057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5344397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5344505Z     kernel = self.compile(
2025-05-07T20:33:03.5344885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5345067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5345197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5345202Z 
2025-05-07T20:33:03.5345405Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666a9a00>
2025-05-07T20:33:03.5346194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5346704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b90283060>}
2025-05-07T20:33:03.5347459Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5347648Z context = <triton._C.libtriton.ir.context object at 0x7f8a696d4c30>
2025-05-07T20:33:03.5347653Z 
2025-05-07T20:33:03.5347829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5348094Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5348203Z                            module_map=module_map)
2025-05-07T20:33:03.5348376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5348478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5348683Z E       ^
2025-05-07T20:33:03.5349048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5349052Z 
2025-05-07T20:33:03.5349509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5349515Z 
2025-05-07T20:33:03.5349625Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5349848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5349929Z     T=4096,
2025-05-07T20:33:03.5350017Z     D=5120,
2025-05-07T20:33:03.5350102Z     scale_ub=1200.0,
2025-05-07T20:33:03.5350190Z     contiguous=False,
2025-05-07T20:33:03.5350283Z     compiled=False,
2025-05-07T20:33:03.5350358Z )
2025-05-07T20:33:03.5350578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5350768Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5350775Z 
2025-05-07T20:33:03.5350853Z     @given(
2025-05-07T20:33:03.5350981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5351082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5351202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5351327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5351442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5351519Z     )
2025-05-07T20:33:03.5351773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5351868Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5351948Z         self,
2025-05-07T20:33:03.5352037Z         T: int,
2025-05-07T20:33:03.5352116Z         D: int,
2025-05-07T20:33:03.5352224Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5352314Z         contiguous: bool,
2025-05-07T20:33:03.5352402Z         compiled: bool,
2025-05-07T20:33:03.5352502Z     ) -> None:
2025-05-07T20:33:03.5352601Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5352677Z     
2025-05-07T20:33:03.5352853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5352934Z     
2025-05-07T20:33:03.5353029Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5353163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5353252Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5353339Z         x0 = x[:, :D]
2025-05-07T20:33:03.5353428Z         x1 = x[:, D:]
2025-05-07T20:33:03.5353503Z     
2025-05-07T20:33:03.5353595Z         if contiguous:
2025-05-07T20:33:03.5353691Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5353782Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5353863Z     
2025-05-07T20:33:03.5353955Z         if scale_ub is not None:
2025-05-07T20:33:03.5354062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5354206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5354292Z             )
2025-05-07T20:33:03.5354370Z         else:
2025-05-07T20:33:03.5354478Z             scale_ub_tensor = None
2025-05-07T20:33:03.5354554Z     
2025-05-07T20:33:03.5354684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5354785Z             op = silu_mul_quant
2025-05-07T20:33:03.5354873Z             if compiled:
2025-05-07T20:33:03.5354982Z                 op = torch.compile(op)
2025-05-07T20:33:03.5355090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5355165Z     
2025-05-07T20:33:03.5355263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5355268Z 
2025-05-07T20:33:03.5355365Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5355495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5355607Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5355708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5356427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5356574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5356932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5357235Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5357572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5357667Z     kernel = self.compile(
2025-05-07T20:33:03.5358053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5358228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5358362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5358366Z 
2025-05-07T20:33:03.5358577Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b666a93d0>
2025-05-07T20:33:03.5359358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5359870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b1b20>}
2025-05-07T20:33:03.5360614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5360810Z context = <triton._C.libtriton.ir.context object at 0x7f8a6990c770>
2025-05-07T20:33:03.5360814Z 
2025-05-07T20:33:03.5360979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5361250Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5361368Z                            module_map=module_map)
2025-05-07T20:33:03.5361533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5361641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5361721Z E       ^
2025-05-07T20:33:03.5362074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5362079Z 
2025-05-07T20:33:03.5362498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5362503Z 
2025-05-07T20:33:03.5362610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5362841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5362921Z     T=4096,
2025-05-07T20:33:03.5363004Z     D=5120,
2025-05-07T20:33:03.5363108Z     scale_ub=1200.0,
2025-05-07T20:33:03.5363200Z     contiguous=False,
2025-05-07T20:33:03.5363285Z     compiled=True,
2025-05-07T20:33:03.5363370Z )
2025-05-07T20:33:03.5363590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5363768Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5363773Z 
2025-05-07T20:33:03.5363862Z     @given(
2025-05-07T20:33:03.5363982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5364083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5364207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5364326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5364448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5364527Z     )
2025-05-07T20:33:03.5364772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5364961Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5365081Z         self,
2025-05-07T20:33:03.5365162Z         T: int,
2025-05-07T20:33:03.5365248Z         D: int,
2025-05-07T20:33:03.5365348Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5365937Z         contiguous: bool,
2025-05-07T20:33:03.5366069Z         compiled: bool,
2025-05-07T20:33:03.5366179Z     ) -> None:
2025-05-07T20:33:03.5366315Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5366434Z     
2025-05-07T20:33:03.5366617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5366701Z     
2025-05-07T20:33:03.5366796Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5366922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5367021Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5367105Z         x0 = x[:, :D]
2025-05-07T20:33:03.5367187Z         x1 = x[:, D:]
2025-05-07T20:33:03.5367270Z     
2025-05-07T20:33:03.5367358Z         if contiguous:
2025-05-07T20:33:03.5367461Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5367565Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5367641Z     
2025-05-07T20:33:03.5367734Z         if scale_ub is not None:
2025-05-07T20:33:03.5367846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5367999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5368078Z             )
2025-05-07T20:33:03.5368163Z         else:
2025-05-07T20:33:03.5368259Z             scale_ub_tensor = None
2025-05-07T20:33:03.5368334Z     
2025-05-07T20:33:03.5368471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5368563Z             op = silu_mul_quant
2025-05-07T20:33:03.5368648Z             if compiled:
2025-05-07T20:33:03.5368755Z                 op = torch.compile(op)
2025-05-07T20:33:03.5368860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5368934Z     
2025-05-07T20:33:03.5369034Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5369039Z 
2025-05-07T20:33:03.5369142Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5369272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5369383Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5369485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5369862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5369955Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5370442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5370548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5370902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5371129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5371471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5371567Z     kernel = self.compile(
2025-05-07T20:33:03.5371953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5372131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5372260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5372265Z 
2025-05-07T20:33:03.5372472Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6685b140>
2025-05-07T20:33:03.5373245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5373922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b3f60>}
2025-05-07T20:33:03.5374729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5374969Z context = <triton._C.libtriton.ir.context object at 0x7f8a699f4bf0>
2025-05-07T20:33:03.5374973Z 
2025-05-07T20:33:03.5375138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5375402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5375518Z                            module_map=module_map)
2025-05-07T20:33:03.5375678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5375778Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5375863Z E       ^
2025-05-07T20:33:03.5376219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5376227Z 
2025-05-07T20:33:03.5376643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5376651Z 
2025-05-07T20:33:03.5376755Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5376977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5377062Z     T=2048,
2025-05-07T20:33:03.5377140Z     D=7168,
2025-05-07T20:33:03.5377224Z     scale_ub=1200.0,
2025-05-07T20:33:03.5377320Z     contiguous=False,
2025-05-07T20:33:03.5377408Z     compiled=False,
2025-05-07T20:33:03.5377489Z )
2025-05-07T20:33:03.5377711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5377889Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5377893Z 
2025-05-07T20:33:03.5377978Z     @given(
2025-05-07T20:33:03.5378101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5378203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5378328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5378448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5378562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5378643Z     )
2025-05-07T20:33:03.5378886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5378986Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5379063Z         self,
2025-05-07T20:33:03.5379142Z         T: int,
2025-05-07T20:33:03.5379226Z         D: int,
2025-05-07T20:33:03.5379326Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5379418Z         contiguous: bool,
2025-05-07T20:33:03.5379511Z         compiled: bool,
2025-05-07T20:33:03.5379591Z     ) -> None:
2025-05-07T20:33:03.5379687Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5379766Z     
2025-05-07T20:33:03.5379940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5380019Z     
2025-05-07T20:33:03.5380117Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5380241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5380343Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5380424Z         x0 = x[:, :D]
2025-05-07T20:33:03.5380506Z         x1 = x[:, D:]
2025-05-07T20:33:03.5380588Z     
2025-05-07T20:33:03.5380674Z         if contiguous:
2025-05-07T20:33:03.5380770Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5380872Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5380948Z     
2025-05-07T20:33:03.5381041Z         if scale_ub is not None:
2025-05-07T20:33:03.5381155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5381290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5381368Z             )
2025-05-07T20:33:03.5381451Z         else:
2025-05-07T20:33:03.5381652Z             scale_ub_tensor = None
2025-05-07T20:33:03.5381772Z     
2025-05-07T20:33:03.5381903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5381995Z             op = silu_mul_quant
2025-05-07T20:33:03.5382126Z             if compiled:
2025-05-07T20:33:03.5382227Z                 op = torch.compile(op)
2025-05-07T20:33:03.5382332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5382413Z     
2025-05-07T20:33:03.5382506Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5382511Z 
2025-05-07T20:33:03.5382609Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5382747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5382848Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5382957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5383454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5383559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5383921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5384144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5384485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5384587Z     kernel = self.compile(
2025-05-07T20:33:03.5384965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5385143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5385270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5385275Z 
2025-05-07T20:33:03.5385478Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6685acc0>
2025-05-07T20:33:03.5386265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5386769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b661b1440>}
2025-05-07T20:33:03.5387519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5387708Z context = <triton._C.libtriton.ir.context object at 0x7f8a6986f030>
2025-05-07T20:33:03.5387712Z 
2025-05-07T20:33:03.5387875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5388143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5388259Z                            module_map=module_map)
2025-05-07T20:33:03.5388427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5388527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5388609Z E       ^
2025-05-07T20:33:03.5388966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5388971Z 
2025-05-07T20:33:03.5389383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5389387Z 
2025-05-07T20:33:03.5389496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5389719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5389797Z     T=1,
2025-05-07T20:33:03.5389885Z     D=7168,
2025-05-07T20:33:03.5389969Z     scale_ub=None,
2025-05-07T20:33:03.5390054Z     contiguous=True,
2025-05-07T20:33:03.5390148Z     compiled=False,
2025-05-07T20:33:03.5390386Z )
2025-05-07T20:33:03.5390605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5390775Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5390823Z 
2025-05-07T20:33:03.5390901Z     @given(
2025-05-07T20:33:03.5391028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5391127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5391242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5391366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5391480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5391555Z     )
2025-05-07T20:33:03.5391805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5391899Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5391976Z         self,
2025-05-07T20:33:03.5392060Z         T: int,
2025-05-07T20:33:03.5392138Z         D: int,
2025-05-07T20:33:03.5392254Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5392345Z         contiguous: bool,
2025-05-07T20:33:03.5392432Z         compiled: bool,
2025-05-07T20:33:03.5392517Z     ) -> None:
2025-05-07T20:33:03.5392615Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5392689Z     
2025-05-07T20:33:03.5392863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5392939Z     
2025-05-07T20:33:03.5393037Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5393167Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5393256Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5393339Z         x0 = x[:, :D]
2025-05-07T20:33:03.5393428Z         x1 = x[:, D:]
2025-05-07T20:33:03.5393503Z     
2025-05-07T20:33:03.5393587Z         if contiguous:
2025-05-07T20:33:03.5393686Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5393774Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5393857Z     
2025-05-07T20:33:03.5393956Z         if scale_ub is not None:
2025-05-07T20:33:03.5394063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5394202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5394283Z             )
2025-05-07T20:33:03.5394360Z         else:
2025-05-07T20:33:03.5394462Z             scale_ub_tensor = None
2025-05-07T20:33:03.5394537Z     
2025-05-07T20:33:03.5394666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5394765Z             op = silu_mul_quant
2025-05-07T20:33:03.5394852Z             if compiled:
2025-05-07T20:33:03.5394953Z                 op = torch.compile(op)
2025-05-07T20:33:03.5395065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5395140Z     
2025-05-07T20:33:03.5395239Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5395243Z 
2025-05-07T20:33:03.5395341Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5395471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5395585Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5395687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5396317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5396425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5396784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5397012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5397348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5397443Z     kernel = self.compile(
2025-05-07T20:33:03.5397828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5398098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5398265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5398280Z 
2025-05-07T20:33:03.5398484Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66858140>
2025-05-07T20:33:03.5399299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5399806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66837740>}
2025-05-07T20:33:03.5400547Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5400752Z context = <triton._C.libtriton.ir.context object at 0x7f8a69c27470>
2025-05-07T20:33:03.5400756Z 
2025-05-07T20:33:03.5400922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5401189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5401305Z                            module_map=module_map)
2025-05-07T20:33:03.5401466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5401574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5401654Z E       ^
2025-05-07T20:33:03.5402012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5402017Z 
2025-05-07T20:33:03.5402437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5402441Z 
2025-05-07T20:33:03.5402546Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5402776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5402863Z     T=16384,
2025-05-07T20:33:03.5402940Z     D=7168,
2025-05-07T20:33:03.5403032Z     scale_ub=1200.0,
2025-05-07T20:33:03.5403120Z     contiguous=False,
2025-05-07T20:33:03.5403204Z     compiled=True,
2025-05-07T20:33:03.5403285Z )
2025-05-07T20:33:03.5403503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5403681Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5403685Z 
2025-05-07T20:33:03.5403769Z     @given(
2025-05-07T20:33:03.5403889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5403988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5404108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5404226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5404350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5404428Z     )
2025-05-07T20:33:03.5404673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5404773Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5404852Z         self,
2025-05-07T20:33:03.5404931Z         T: int,
2025-05-07T20:33:03.5405015Z         D: int,
2025-05-07T20:33:03.5405115Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5405205Z         contiguous: bool,
2025-05-07T20:33:03.5405299Z         compiled: bool,
2025-05-07T20:33:03.5405379Z     ) -> None:
2025-05-07T20:33:03.5405475Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5405554Z     
2025-05-07T20:33:03.5405724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5405806Z     
2025-05-07T20:33:03.5405898Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5406022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5406119Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5406332Z         x0 = x[:, :D]
2025-05-07T20:33:03.5406416Z         x1 = x[:, D:]
2025-05-07T20:33:03.5406497Z     
2025-05-07T20:33:03.5406581Z         if contiguous:
2025-05-07T20:33:03.5406674Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5406812Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5406886Z     
2025-05-07T20:33:03.5406978Z         if scale_ub is not None:
2025-05-07T20:33:03.5407091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5407226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5407312Z             )
2025-05-07T20:33:03.5407389Z         else:
2025-05-07T20:33:03.5407485Z             scale_ub_tensor = None
2025-05-07T20:33:03.5407568Z     
2025-05-07T20:33:03.5407697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5407788Z             op = silu_mul_quant
2025-05-07T20:33:03.5407881Z             if compiled:
2025-05-07T20:33:03.5407981Z                 op = torch.compile(op)
2025-05-07T20:33:03.5408099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5408185Z     
2025-05-07T20:33:03.5408277Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5408282Z 
2025-05-07T20:33:03.5408379Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5408520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5408621Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5408729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5409097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5409190Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5409685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5409786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5410146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5410378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5410716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5410821Z     kernel = self.compile(
2025-05-07T20:33:03.5411200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5411376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5411512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5411516Z 
2025-05-07T20:33:03.5411719Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66945eb0>
2025-05-07T20:33:03.5412507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5413009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b668345e0>}
2025-05-07T20:33:03.5413756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5413953Z context = <triton._C.libtriton.ir.context object at 0x7f8a69624030>
2025-05-07T20:33:03.5413957Z 
2025-05-07T20:33:03.5414122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5414392Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5414501Z                            module_map=module_map)
2025-05-07T20:33:03.5414748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5414894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5414973Z E       ^
2025-05-07T20:33:03.5415332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5415374Z 
2025-05-07T20:33:03.5415788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5415793Z 
2025-05-07T20:33:03.5415897Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5416127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5416206Z     T=1,
2025-05-07T20:33:03.5416284Z     D=7168,
2025-05-07T20:33:03.5416375Z     scale_ub=None,
2025-05-07T20:33:03.5416464Z     contiguous=False,
2025-05-07T20:33:03.5416557Z     compiled=False,
2025-05-07T20:33:03.5416632Z )
2025-05-07T20:33:03.5416852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5417034Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5417039Z 
2025-05-07T20:33:03.5417119Z     @given(
2025-05-07T20:33:03.5417239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5417348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5417463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5417580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5417698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5417773Z     )
2025-05-07T20:33:03.5418022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5418116Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5418195Z         self,
2025-05-07T20:33:03.5418278Z         T: int,
2025-05-07T20:33:03.5418357Z         D: int,
2025-05-07T20:33:03.5418459Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5418555Z         contiguous: bool,
2025-05-07T20:33:03.5418651Z         compiled: bool,
2025-05-07T20:33:03.5418730Z     ) -> None:
2025-05-07T20:33:03.5418832Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5418906Z     
2025-05-07T20:33:03.5419080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5419161Z     
2025-05-07T20:33:03.5419256Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5419386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5419476Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5419556Z         x0 = x[:, :D]
2025-05-07T20:33:03.5419644Z         x1 = x[:, D:]
2025-05-07T20:33:03.5419718Z     
2025-05-07T20:33:03.5419802Z         if contiguous:
2025-05-07T20:33:03.5419900Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5419991Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5420064Z     
2025-05-07T20:33:03.5420163Z         if scale_ub is not None:
2025-05-07T20:33:03.5420269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5420414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5420498Z             )
2025-05-07T20:33:03.5420575Z         else:
2025-05-07T20:33:03.5420670Z             scale_ub_tensor = None
2025-05-07T20:33:03.5420755Z     
2025-05-07T20:33:03.5420885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5420981Z             op = silu_mul_quant
2025-05-07T20:33:03.5421066Z             if compiled:
2025-05-07T20:33:03.5421169Z                 op = torch.compile(op)
2025-05-07T20:33:03.5421280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5421353Z     
2025-05-07T20:33:03.5421445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5421449Z 
2025-05-07T20:33:03.5421553Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5421682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5421783Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5421889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5422555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5422660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5423060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5423285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5423636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5423733Z     kernel = self.compile(
2025-05-07T20:33:03.5424114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5424293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5424422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5424437Z 
2025-05-07T20:33:03.5424646Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a2480>
2025-05-07T20:33:03.5425420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5425932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b674593a0>}
2025-05-07T20:33:03.5426674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5426863Z context = <triton._C.libtriton.ir.context object at 0x7f8a695b0bf0>
2025-05-07T20:33:03.5426867Z 
2025-05-07T20:33:03.5427043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5427306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5427422Z                            module_map=module_map)
2025-05-07T20:33:03.5427585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5427686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5427770Z E       ^
2025-05-07T20:33:03.5428125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5428130Z 
2025-05-07T20:33:03.5428540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5428551Z 
2025-05-07T20:33:03.5428656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5428879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5428972Z     T=2048,
2025-05-07T20:33:03.5429051Z     D=7168,
2025-05-07T20:33:03.5429137Z     scale_ub=None,
2025-05-07T20:33:03.5429231Z     contiguous=False,
2025-05-07T20:33:03.5429314Z     compiled=True,
2025-05-07T20:33:03.5429392Z )
2025-05-07T20:33:03.5429617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5429790Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5429795Z 
2025-05-07T20:33:03.5429873Z     @given(
2025-05-07T20:33:03.5429999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5430098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5430221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5430339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5430453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5430534Z     )
2025-05-07T20:33:03.5430859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5430991Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5431076Z         self,
2025-05-07T20:33:03.5431156Z         T: int,
2025-05-07T20:33:03.5431235Z         D: int,
2025-05-07T20:33:03.5431388Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5431478Z         contiguous: bool,
2025-05-07T20:33:03.5431573Z         compiled: bool,
2025-05-07T20:33:03.5431653Z     ) -> None:
2025-05-07T20:33:03.5431749Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5431829Z     
2025-05-07T20:33:03.5431996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5432070Z     
2025-05-07T20:33:03.5432170Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5432295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5432384Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5432472Z         x0 = x[:, :D]
2025-05-07T20:33:03.5432552Z         x1 = x[:, D:]
2025-05-07T20:33:03.5432626Z     
2025-05-07T20:33:03.5432726Z         if contiguous:
2025-05-07T20:33:03.5432819Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5432909Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5432991Z     
2025-05-07T20:33:03.5433086Z         if scale_ub is not None:
2025-05-07T20:33:03.5433198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5433331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5433409Z             )
2025-05-07T20:33:03.5433492Z         else:
2025-05-07T20:33:03.5433587Z             scale_ub_tensor = None
2025-05-07T20:33:03.5433661Z     
2025-05-07T20:33:03.5433799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5433890Z             op = silu_mul_quant
2025-05-07T20:33:03.5433980Z             if compiled:
2025-05-07T20:33:03.5434092Z                 op = torch.compile(op)
2025-05-07T20:33:03.5434201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5434276Z     
2025-05-07T20:33:03.5434378Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5434386Z 
2025-05-07T20:33:03.5434485Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5434620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5434726Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5434827Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5435200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5435295Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5435898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5436007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5436364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5436600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5436939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5437035Z     kernel = self.compile(
2025-05-07T20:33:03.5437424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5437599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5437734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5437739Z 
2025-05-07T20:33:03.5437943Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a2510>
2025-05-07T20:33:03.5438716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5439320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67115b20>}
2025-05-07T20:33:03.5440104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5440342Z context = <triton._C.libtriton.ir.context object at 0x7f8a69ce0a30>
2025-05-07T20:33:03.5440347Z 
2025-05-07T20:33:03.5440511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5440774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5440893Z                            module_map=module_map)
2025-05-07T20:33:03.5441057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5441164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5441243Z E       ^
2025-05-07T20:33:03.5441604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5441609Z 
2025-05-07T20:33:03.5442026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5442036Z 
2025-05-07T20:33:03.5442141Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5442371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5442450Z     T=4096,
2025-05-07T20:33:03.5442531Z     D=7168,
2025-05-07T20:33:03.5442619Z     scale_ub=None,
2025-05-07T20:33:03.5442706Z     contiguous=False,
2025-05-07T20:33:03.5442790Z     compiled=True,
2025-05-07T20:33:03.5442871Z )
2025-05-07T20:33:03.5443093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5443266Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5443270Z 
2025-05-07T20:33:03.5443363Z     @given(
2025-05-07T20:33:03.5443482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5443581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5443709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5443825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5443943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5444020Z     )
2025-05-07T20:33:03.5444263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5444363Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5444440Z         self,
2025-05-07T20:33:03.5444518Z         T: int,
2025-05-07T20:33:03.5444602Z         D: int,
2025-05-07T20:33:03.5444701Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5444791Z         contiguous: bool,
2025-05-07T20:33:03.5444883Z         compiled: bool,
2025-05-07T20:33:03.5444963Z     ) -> None:
2025-05-07T20:33:03.5445066Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5445151Z     
2025-05-07T20:33:03.5445319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5445401Z     
2025-05-07T20:33:03.5445496Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5445620Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5445716Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5445799Z         x0 = x[:, :D]
2025-05-07T20:33:03.5445880Z         x1 = x[:, D:]
2025-05-07T20:33:03.5445974Z     
2025-05-07T20:33:03.5446059Z         if contiguous:
2025-05-07T20:33:03.5446152Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5446252Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5446327Z     
2025-05-07T20:33:03.5446420Z         if scale_ub is not None:
2025-05-07T20:33:03.5452651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5452820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5452909Z             )
2025-05-07T20:33:03.5453173Z         else:
2025-05-07T20:33:03.5453275Z             scale_ub_tensor = None
2025-05-07T20:33:03.5453360Z     
2025-05-07T20:33:03.5453497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5453636Z             op = silu_mul_quant
2025-05-07T20:33:03.5453732Z             if compiled:
2025-05-07T20:33:03.5453841Z                 op = torch.compile(op)
2025-05-07T20:33:03.5453950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5454035Z     
2025-05-07T20:33:03.5454128Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5454134Z 
2025-05-07T20:33:03.5454241Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5454377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5454485Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5454596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5454972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5455078Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5455580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5455683Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5456047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5456273Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5456613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5456717Z     kernel = self.compile(
2025-05-07T20:33:03.5457100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5457273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5457414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5457421Z 
2025-05-07T20:33:03.5457638Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671a1970>
2025-05-07T20:33:03.5458421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5458935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b67115760>}
2025-05-07T20:33:03.5459681Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5459875Z context = <triton._C.libtriton.ir.context object at 0x7f8b667156f0>
2025-05-07T20:33:03.5459886Z 
2025-05-07T20:33:03.5460065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5460331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5460454Z                            module_map=module_map)
2025-05-07T20:33:03.5460619Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5460725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5460816Z E       ^
2025-05-07T20:33:03.5461176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5461181Z 
2025-05-07T20:33:03.5461595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5461610Z 
2025-05-07T20:33:03.5461717Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5462161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5462290Z     T=16384,
2025-05-07T20:33:03.5462372Z     D=5120,
2025-05-07T20:33:03.5462460Z     scale_ub=1200.0,
2025-05-07T20:33:03.5462557Z     contiguous=False,
2025-05-07T20:33:03.5462685Z     compiled=False,
2025-05-07T20:33:03.5462763Z )
2025-05-07T20:33:03.5462993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5463176Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5463181Z 
2025-05-07T20:33:03.5463262Z     @given(
2025-05-07T20:33:03.5463393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5463495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5463624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5463745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5463862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5463947Z     )
2025-05-07T20:33:03.5464201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5464297Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5464384Z         self,
2025-05-07T20:33:03.5464470Z         T: int,
2025-05-07T20:33:03.5464554Z         D: int,
2025-05-07T20:33:03.5464667Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5464758Z         contiguous: bool,
2025-05-07T20:33:03.5464856Z         compiled: bool,
2025-05-07T20:33:03.5464939Z     ) -> None:
2025-05-07T20:33:03.5465037Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5465122Z     
2025-05-07T20:33:03.5465291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5465855Z     
2025-05-07T20:33:03.5466006Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5466178Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5466280Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5466369Z         x0 = x[:, :D]
2025-05-07T20:33:03.5466461Z         x1 = x[:, D:]
2025-05-07T20:33:03.5466537Z     
2025-05-07T20:33:03.5466630Z         if contiguous:
2025-05-07T20:33:03.5466729Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5466821Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5466911Z     
2025-05-07T20:33:03.5467002Z         if scale_ub is not None:
2025-05-07T20:33:03.5467117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5467252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5467332Z             )
2025-05-07T20:33:03.5467418Z         else:
2025-05-07T20:33:03.5467514Z             scale_ub_tensor = None
2025-05-07T20:33:03.5467591Z     
2025-05-07T20:33:03.5467731Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5467824Z             op = silu_mul_quant
2025-05-07T20:33:03.5467912Z             if compiled:
2025-05-07T20:33:03.5468022Z                 op = torch.compile(op)
2025-05-07T20:33:03.5468131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5468214Z     
2025-05-07T20:33:03.5468316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5468320Z 
2025-05-07T20:33:03.5468419Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5468558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5468662Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5468766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5469269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5469367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5469724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5469955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5470519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5470680Z     kernel = self.compile(
2025-05-07T20:33:03.5471064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5471308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5471443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5471447Z 
2025-05-07T20:33:03.5471651Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b91bf73b0>
2025-05-07T20:33:03.5472435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5472946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b671147c0>}
2025-05-07T20:33:03.5473694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5473897Z context = <triton._C.libtriton.ir.context object at 0x7f8b66731ab0>
2025-05-07T20:33:03.5473901Z 
2025-05-07T20:33:03.5474068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5474344Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5474453Z                            module_map=module_map)
2025-05-07T20:33:03.5474615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5474729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5474809Z E       ^
2025-05-07T20:33:03.5475171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5475182Z 
2025-05-07T20:33:03.5475594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5475601Z 
2025-05-07T20:33:03.5475709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5476029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5476111Z     T=16384,
2025-05-07T20:33:03.5476190Z     D=5120,
2025-05-07T20:33:03.5476287Z     scale_ub=1200.0,
2025-05-07T20:33:03.5476376Z     contiguous=True,
2025-05-07T20:33:03.5476469Z     compiled=True,
2025-05-07T20:33:03.5476548Z )
2025-05-07T20:33:03.5476768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5476955Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5476959Z 
2025-05-07T20:33:03.5477040Z     @given(
2025-05-07T20:33:03.5477161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5477278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5477398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5477517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5477644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5477724Z     )
2025-05-07T20:33:03.5477979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5478074Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5478154Z         self,
2025-05-07T20:33:03.5478241Z         T: int,
2025-05-07T20:33:03.5478320Z         D: int,
2025-05-07T20:33:03.5478421Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5478520Z         contiguous: bool,
2025-05-07T20:33:03.5478608Z         compiled: bool,
2025-05-07T20:33:03.5478690Z     ) -> None:
2025-05-07T20:33:03.5478795Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5478871Z     
2025-05-07T20:33:03.5479125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5479251Z     
2025-05-07T20:33:03.5479348Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5479486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5479625Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5479708Z         x0 = x[:, :D]
2025-05-07T20:33:03.5479801Z         x1 = x[:, D:]
2025-05-07T20:33:03.5479877Z     
2025-05-07T20:33:03.5479964Z         if contiguous:
2025-05-07T20:33:03.5480066Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5480157Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5480233Z     
2025-05-07T20:33:03.5480335Z         if scale_ub is not None:
2025-05-07T20:33:03.5480444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5480580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5480667Z             )
2025-05-07T20:33:03.5480746Z         else:
2025-05-07T20:33:03.5480851Z             scale_ub_tensor = None
2025-05-07T20:33:03.5480929Z     
2025-05-07T20:33:03.5481065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5481167Z             op = silu_mul_quant
2025-05-07T20:33:03.5481253Z             if compiled:
2025-05-07T20:33:03.5481359Z                 op = torch.compile(op)
2025-05-07T20:33:03.5481473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5481548Z     
2025-05-07T20:33:03.5481640Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5481645Z 
2025-05-07T20:33:03.5481751Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5481880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5481982Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5482091Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5482459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5482562Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5483058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5483160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5483523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5483749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5484091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5484187Z     kernel = self.compile(
2025-05-07T20:33:03.5484568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5484751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5484881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5484886Z 
2025-05-07T20:33:03.5485096Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67f15640>
2025-05-07T20:33:03.5485879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5486385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ef7740>}
2025-05-07T20:33:03.5487136Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5487325Z context = <triton._C.libtriton.ir.context object at 0x7f8b667cbcf0>
2025-05-07T20:33:03.5487329Z 
2025-05-07T20:33:03.5487501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5487891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5488001Z                            module_map=module_map)
2025-05-07T20:33:03.5488241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5488342Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5488423Z E       ^
2025-05-07T20:33:03.5488786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5488791Z 
2025-05-07T20:33:03.5489201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5489206Z 
2025-05-07T20:33:03.5489320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5489544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5489623Z     T=16384,
2025-05-07T20:33:03.5489709Z     D=5120,
2025-05-07T20:33:03.5489801Z     scale_ub=None,
2025-05-07T20:33:03.5489890Z     contiguous=False,
2025-05-07T20:33:03.5489982Z     compiled=True,
2025-05-07T20:33:03.5490057Z )
2025-05-07T20:33:03.5490283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5490463Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5490468Z 
2025-05-07T20:33:03.5490546Z     @given(
2025-05-07T20:33:03.5490672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5490774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5490890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5491022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5491141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5491219Z     )
2025-05-07T20:33:03.5491472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5491576Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5491664Z         self,
2025-05-07T20:33:03.5491745Z         T: int,
2025-05-07T20:33:03.5491825Z         D: int,
2025-05-07T20:33:03.5491935Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5492032Z         contiguous: bool,
2025-05-07T20:33:03.5492120Z         compiled: bool,
2025-05-07T20:33:03.5492209Z     ) -> None:
2025-05-07T20:33:03.5492305Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5492384Z     
2025-05-07T20:33:03.5492560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5492637Z     
2025-05-07T20:33:03.5492731Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5492865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5492956Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5493047Z         x0 = x[:, :D]
2025-05-07T20:33:03.5493133Z         x1 = x[:, D:]
2025-05-07T20:33:03.5493209Z     
2025-05-07T20:33:03.5493304Z         if contiguous:
2025-05-07T20:33:03.5493404Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5493496Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5493582Z     
2025-05-07T20:33:03.5493674Z         if scale_ub is not None:
2025-05-07T20:33:03.5493787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5493932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5494010Z             )
2025-05-07T20:33:03.5494092Z         else:
2025-05-07T20:33:03.5494197Z             scale_ub_tensor = None
2025-05-07T20:33:03.5494273Z     
2025-05-07T20:33:03.5494404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5494503Z             op = silu_mul_quant
2025-05-07T20:33:03.5494590Z             if compiled:
2025-05-07T20:33:03.5494700Z                 op = torch.compile(op)
2025-05-07T20:33:03.5494807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5494885Z     
2025-05-07T20:33:03.5494992Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5494996Z 
2025-05-07T20:33:03.5495226Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5495359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5495467Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5495626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5495994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5496098Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5496587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5496685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5497047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5497268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5497612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5497717Z     kernel = self.compile(
2025-05-07T20:33:03.5498097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5498284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5498411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5498416Z 
2025-05-07T20:33:03.5498622Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b668f09b0>
2025-05-07T20:33:03.5499404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5499910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69accb80>}
2025-05-07T20:33:03.5500658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5500851Z context = <triton._C.libtriton.ir.context object at 0x7f8a698ec2b0>
2025-05-07T20:33:03.5500855Z 
2025-05-07T20:33:03.5501017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5501285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5501392Z                            module_map=module_map)
2025-05-07T20:33:03.5501562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5501662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5501741Z E       ^
2025-05-07T20:33:03.5502104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5502111Z 
2025-05-07T20:33:03.5502522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5502528Z 
2025-05-07T20:33:03.5502639Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5502862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5502941Z     T=2048,
2025-05-07T20:33:03.5503024Z     D=5120,
2025-05-07T20:33:03.5503108Z     scale_ub=None,
2025-05-07T20:33:03.5503195Z     contiguous=False,
2025-05-07T20:33:03.5503287Z     compiled=True,
2025-05-07T20:33:03.5503362Z )
2025-05-07T20:33:03.5503579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5503759Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5503763Z 
2025-05-07T20:33:03.5503840Z     @given(
2025-05-07T20:33:03.5504089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5504191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5504308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5504471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5504586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5504660Z     )
2025-05-07T20:33:03.5504911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5505006Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5505084Z         self,
2025-05-07T20:33:03.5505170Z         T: int,
2025-05-07T20:33:03.5505247Z         D: int,
2025-05-07T20:33:03.5505351Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5505443Z         contiguous: bool,
2025-05-07T20:33:03.5505530Z         compiled: bool,
2025-05-07T20:33:03.5505615Z     ) -> None:
2025-05-07T20:33:03.5505714Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5505787Z     
2025-05-07T20:33:03.5505968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5506044Z     
2025-05-07T20:33:03.5506136Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5506270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5506361Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5506443Z         x0 = x[:, :D]
2025-05-07T20:33:03.5506531Z         x1 = x[:, D:]
2025-05-07T20:33:03.5506606Z     
2025-05-07T20:33:03.5506690Z         if contiguous:
2025-05-07T20:33:03.5506789Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5506880Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5506959Z     
2025-05-07T20:33:03.5507050Z         if scale_ub is not None:
2025-05-07T20:33:03.5507157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5507297Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5507374Z             )
2025-05-07T20:33:03.5507452Z         else:
2025-05-07T20:33:03.5507561Z             scale_ub_tensor = None
2025-05-07T20:33:03.5507634Z     
2025-05-07T20:33:03.5507766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5507864Z             op = silu_mul_quant
2025-05-07T20:33:03.5507953Z             if compiled:
2025-05-07T20:33:03.5508053Z                 op = torch.compile(op)
2025-05-07T20:33:03.5508165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5508239Z     
2025-05-07T20:33:03.5508338Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5508342Z 
2025-05-07T20:33:03.5508443Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5508574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5508682Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5508783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5509149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5509256Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5509801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5509906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5510264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5510487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5510834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5510929Z     kernel = self.compile(
2025-05-07T20:33:03.5511306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5511492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5511704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5511746Z 
2025-05-07T20:33:03.5511957Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66886540>
2025-05-07T20:33:03.5512732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5513281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69ace0c0>}
2025-05-07T20:33:03.5514023Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5514214Z context = <triton._C.libtriton.ir.context object at 0x7f8a69208bf0>
2025-05-07T20:33:03.5514219Z 
2025-05-07T20:33:03.5514396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5514658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5514781Z                            module_map=module_map)
2025-05-07T20:33:03.5514944Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5515046Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5515132Z E       ^
2025-05-07T20:33:03.5515485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5515490Z 
2025-05-07T20:33:03.5516050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5516062Z 
2025-05-07T20:33:03.5516169Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5516394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5516489Z     T=2048,
2025-05-07T20:33:03.5516568Z     D=5120,
2025-05-07T20:33:03.5516654Z     scale_ub=1200.0,
2025-05-07T20:33:03.5516751Z     contiguous=False,
2025-05-07T20:33:03.5516837Z     compiled=True,
2025-05-07T20:33:03.5516916Z )
2025-05-07T20:33:03.5517140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5517316Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5517320Z 
2025-05-07T20:33:03.5517401Z     @given(
2025-05-07T20:33:03.5517527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5517629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5517749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5517866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5517981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5518069Z     )
2025-05-07T20:33:03.5518317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5518416Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5518503Z         self,
2025-05-07T20:33:03.5518583Z         T: int,
2025-05-07T20:33:03.5518665Z         D: int,
2025-05-07T20:33:03.5518772Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5518862Z         contiguous: bool,
2025-05-07T20:33:03.5518953Z         compiled: bool,
2025-05-07T20:33:03.5519033Z     ) -> None:
2025-05-07T20:33:03.5519128Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5519208Z     
2025-05-07T20:33:03.5519381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5519475Z     
2025-05-07T20:33:03.5519582Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5519724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5519815Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5519902Z         x0 = x[:, :D]
2025-05-07T20:33:03.5519986Z         x1 = x[:, D:]
2025-05-07T20:33:03.5520061Z     
2025-05-07T20:33:03.5520325Z         if contiguous:
2025-05-07T20:33:03.5520419Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5520509Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5520589Z     
2025-05-07T20:33:03.5520721Z         if scale_ub is not None:
2025-05-07T20:33:03.5520832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5520966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5521043Z             )
2025-05-07T20:33:03.5521126Z         else:
2025-05-07T20:33:03.5521222Z             scale_ub_tensor = None
2025-05-07T20:33:03.5521296Z     
2025-05-07T20:33:03.5521434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5521524Z             op = silu_mul_quant
2025-05-07T20:33:03.5521610Z             if compiled:
2025-05-07T20:33:03.5521719Z                 op = torch.compile(op)
2025-05-07T20:33:03.5521825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5521899Z     
2025-05-07T20:33:03.5522004Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5522008Z 
2025-05-07T20:33:03.5522107Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5522244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5522348Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5522448Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5522819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5522913Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5523402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5523508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5523864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5524097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5524439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5524533Z     kernel = self.compile(
2025-05-07T20:33:03.5524923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5525098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5525233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5525237Z 
2025-05-07T20:33:03.5525444Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b668875f0>
2025-05-07T20:33:03.5526217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5526731Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69acf2e0>}
2025-05-07T20:33:03.5527472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5527672Z context = <triton._C.libtriton.ir.context object at 0x7f8a6940d970>
2025-05-07T20:33:03.5527676Z 
2025-05-07T20:33:03.5527840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5528102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5528218Z                            module_map=module_map)
2025-05-07T20:33:03.5528382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5528492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5528574Z E       ^
2025-05-07T20:33:03.5529055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5529060Z 
2025-05-07T20:33:03.5529478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5529520Z 
2025-05-07T20:33:03.5529626Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5529856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5529941Z     T=4096,
2025-05-07T20:33:03.5530021Z     D=5120,
2025-05-07T20:33:03.5530114Z     scale_ub=1200.0,
2025-05-07T20:33:03.5530203Z     contiguous=True,
2025-05-07T20:33:03.5530290Z     compiled=True,
2025-05-07T20:33:03.5530374Z )
2025-05-07T20:33:03.5530598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5530772Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5530779Z 
2025-05-07T20:33:03.5530872Z     @given(
2025-05-07T20:33:03.5530992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5531100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5531221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5531339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5531458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5531535Z     )
2025-05-07T20:33:03.5531778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5531881Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5531961Z         self,
2025-05-07T20:33:03.5532040Z         T: int,
2025-05-07T20:33:03.5532126Z         D: int,
2025-05-07T20:33:03.5532227Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5532317Z         contiguous: bool,
2025-05-07T20:33:03.5532409Z         compiled: bool,
2025-05-07T20:33:03.5532489Z     ) -> None:
2025-05-07T20:33:03.5532601Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5532675Z     
2025-05-07T20:33:03.5532843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5532925Z     
2025-05-07T20:33:03.5533020Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5533145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5533240Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5533321Z         x0 = x[:, :D]
2025-05-07T20:33:03.5533402Z         x1 = x[:, D:]
2025-05-07T20:33:03.5533483Z     
2025-05-07T20:33:03.5533567Z         if contiguous:
2025-05-07T20:33:03.5533660Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5533756Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5533830Z     
2025-05-07T20:33:03.5533923Z         if scale_ub is not None:
2025-05-07T20:33:03.5534037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5534170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5534254Z             )
2025-05-07T20:33:03.5534340Z         else:
2025-05-07T20:33:03.5534435Z             scale_ub_tensor = None
2025-05-07T20:33:03.5534520Z     
2025-05-07T20:33:03.5534651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5534746Z             op = silu_mul_quant
2025-05-07T20:33:03.5534841Z             if compiled:
2025-05-07T20:33:03.5534944Z                 op = torch.compile(op)
2025-05-07T20:33:03.5535050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5535131Z     
2025-05-07T20:33:03.5535224Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5535228Z 
2025-05-07T20:33:03.5535334Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5535462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5535563Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5535670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5536210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5536345Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5536840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5536984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5537349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5537571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5537907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5538008Z     kernel = self.compile(
2025-05-07T20:33:03.5538387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5538560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5538702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5538706Z 
2025-05-07T20:33:03.5538911Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b40e30>
2025-05-07T20:33:03.5539693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5540199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948c860>}
2025-05-07T20:33:03.5540946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5541141Z context = <triton._C.libtriton.ir.context object at 0x7f8a69418370>
2025-05-07T20:33:03.5541147Z 
2025-05-07T20:33:03.5541314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5541585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5541696Z                            module_map=module_map)
2025-05-07T20:33:03.5541858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5541968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5542047Z E       ^
2025-05-07T20:33:03.5542409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5542413Z 
2025-05-07T20:33:03.5542825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5542829Z 
2025-05-07T20:33:03.5542936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5543172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5543254Z     T=128,
2025-05-07T20:33:03.5543342Z     D=5120,
2025-05-07T20:33:03.5543432Z     scale_ub=1200.0,
2025-05-07T20:33:03.5543524Z     contiguous=False,
2025-05-07T20:33:03.5543617Z     compiled=True,
2025-05-07T20:33:03.5543694Z )
2025-05-07T20:33:03.5543913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5544091Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5544096Z 
2025-05-07T20:33:03.5544178Z     @given(
2025-05-07T20:33:03.5544299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5544407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5544524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5544650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5544765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5544895Z     )
2025-05-07T20:33:03.5545221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5545318Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5545399Z         self,
2025-05-07T20:33:03.5545525Z         T: int,
2025-05-07T20:33:03.5545607Z         D: int,
2025-05-07T20:33:03.5545710Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5545809Z         contiguous: bool,
2025-05-07T20:33:03.5545897Z         compiled: bool,
2025-05-07T20:33:03.5545977Z     ) -> None:
2025-05-07T20:33:03.5546077Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5546151Z     
2025-05-07T20:33:03.5546327Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5546402Z     
2025-05-07T20:33:03.5546494Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5546623Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5546713Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5546794Z         x0 = x[:, :D]
2025-05-07T20:33:03.5546889Z         x1 = x[:, D:]
2025-05-07T20:33:03.5546963Z     
2025-05-07T20:33:03.5547048Z         if contiguous:
2025-05-07T20:33:03.5547148Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5547237Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5547314Z     
2025-05-07T20:33:03.5547411Z         if scale_ub is not None:
2025-05-07T20:33:03.5547517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5547652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5547736Z             )
2025-05-07T20:33:03.5547815Z         else:
2025-05-07T20:33:03.5547916Z             scale_ub_tensor = None
2025-05-07T20:33:03.5547991Z     
2025-05-07T20:33:03.5548120Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5548220Z             op = silu_mul_quant
2025-05-07T20:33:03.5548307Z             if compiled:
2025-05-07T20:33:03.5548408Z                 op = torch.compile(op)
2025-05-07T20:33:03.5548519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5548603Z     
2025-05-07T20:33:03.5548695Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5548700Z 
2025-05-07T20:33:03.5548806Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5548936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5549043Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5549144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5549537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5549656Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5550149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5550246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5550610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5550838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5551180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5551278Z     kernel = self.compile(
2025-05-07T20:33:03.5551659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5551838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5551965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5551969Z 
2025-05-07T20:33:03.5552178Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66b42780>
2025-05-07T20:33:03.5553075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5553616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948d580>}
2025-05-07T20:33:03.5554406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5554596Z context = <triton._C.libtriton.ir.context object at 0x7f8a6951a9b0>
2025-05-07T20:33:03.5554600Z 
2025-05-07T20:33:03.5554773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5555035Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5555144Z                            module_map=module_map)
2025-05-07T20:33:03.5555314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5555421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5555503Z E       ^
2025-05-07T20:33:03.5556020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5556028Z 
2025-05-07T20:33:03.5556442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5556446Z 
2025-05-07T20:33:03.5556557Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5556783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5556863Z     T=16384,
2025-05-07T20:33:03.5556948Z     D=7168,
2025-05-07T20:33:03.5557034Z     scale_ub=1200.0,
2025-05-07T20:33:03.5557122Z     contiguous=True,
2025-05-07T20:33:03.5557213Z     compiled=True,
2025-05-07T20:33:03.5557289Z )
2025-05-07T20:33:03.5557512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5557692Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5557698Z 
2025-05-07T20:33:03.5557779Z     @given(
2025-05-07T20:33:03.5557904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5558006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5558124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5558251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5558365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5558439Z     )
2025-05-07T20:33:03.5558687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5558781Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5558864Z         self,
2025-05-07T20:33:03.5558943Z         T: int,
2025-05-07T20:33:03.5559021Z         D: int,
2025-05-07T20:33:03.5559127Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5559217Z         contiguous: bool,
2025-05-07T20:33:03.5559303Z         compiled: bool,
2025-05-07T20:33:03.5559397Z     ) -> None:
2025-05-07T20:33:03.5559492Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5559568Z     
2025-05-07T20:33:03.5559740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5559819Z     
2025-05-07T20:33:03.5559912Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5560042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5560132Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5560220Z         x0 = x[:, :D]
2025-05-07T20:33:03.5560301Z         x1 = x[:, D:]
2025-05-07T20:33:03.5560376Z     
2025-05-07T20:33:03.5560466Z         if contiguous:
2025-05-07T20:33:03.5560558Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5560647Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5560728Z     
2025-05-07T20:33:03.5560819Z         if scale_ub is not None:
2025-05-07T20:33:03.5560927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5561156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5561271Z             )
2025-05-07T20:33:03.5561349Z         else:
2025-05-07T20:33:03.5561454Z             scale_ub_tensor = None
2025-05-07T20:33:03.5561528Z     
2025-05-07T20:33:03.5561704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5561803Z             op = silu_mul_quant
2025-05-07T20:33:03.5561889Z             if compiled:
2025-05-07T20:33:03.5561997Z                 op = torch.compile(op)
2025-05-07T20:33:03.5562103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5562178Z     
2025-05-07T20:33:03.5562277Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5562282Z 
2025-05-07T20:33:03.5562380Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5562508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5562620Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5562720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5563092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5563196Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5563688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5563797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5564154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5564380Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5564724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5564821Z     kernel = self.compile(
2025-05-07T20:33:03.5565208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5565740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5565873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5565878Z 
2025-05-07T20:33:03.5566089Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b6619c890>
2025-05-07T20:33:03.5566861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5567368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948e0c0>}
2025-05-07T20:33:03.5568109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5568306Z context = <triton._C.libtriton.ir.context object at 0x7f8a695474f0>
2025-05-07T20:33:03.5568318Z 
2025-05-07T20:33:03.5568484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5568752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5568869Z                            module_map=module_map)
2025-05-07T20:33:03.5569032Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5569133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5569220Z E       ^
2025-05-07T20:33:03.5569613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5569619Z 
2025-05-07T20:33:03.5570046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5570051Z 
2025-05-07T20:33:03.5570356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5570648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5570735Z     T=16384,
2025-05-07T20:33:03.5570818Z     D=5120,
2025-05-07T20:33:03.5570970Z     scale_ub=1200.0,
2025-05-07T20:33:03.5571065Z     contiguous=True,
2025-05-07T20:33:03.5571151Z     compiled=False,
2025-05-07T20:33:03.5571228Z )
2025-05-07T20:33:03.5571453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5571633Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5571637Z 
2025-05-07T20:33:03.5571723Z     @given(
2025-05-07T20:33:03.5571842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5571943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5572069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5572187Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5572306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5572391Z     )
2025-05-07T20:33:03.5572635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5572734Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5572819Z         self,
2025-05-07T20:33:03.5572898Z         T: int,
2025-05-07T20:33:03.5572983Z         D: int,
2025-05-07T20:33:03.5573082Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5573171Z         contiguous: bool,
2025-05-07T20:33:03.5573264Z         compiled: bool,
2025-05-07T20:33:03.5573342Z     ) -> None:
2025-05-07T20:33:03.5573436Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5573515Z     
2025-05-07T20:33:03.5573684Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5573759Z     
2025-05-07T20:33:03.5573858Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5573983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5574074Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5574166Z         x0 = x[:, :D]
2025-05-07T20:33:03.5574248Z         x1 = x[:, D:]
2025-05-07T20:33:03.5574321Z     
2025-05-07T20:33:03.5574415Z         if contiguous:
2025-05-07T20:33:03.5574510Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5574606Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5574680Z     
2025-05-07T20:33:03.5574771Z         if scale_ub is not None:
2025-05-07T20:33:03.5574887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5575025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5575103Z             )
2025-05-07T20:33:03.5575201Z         else:
2025-05-07T20:33:03.5575299Z             scale_ub_tensor = None
2025-05-07T20:33:03.5575375Z     
2025-05-07T20:33:03.5575514Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5575609Z             op = silu_mul_quant
2025-05-07T20:33:03.5575698Z             if compiled:
2025-05-07T20:33:03.5582012Z                 op = torch.compile(op)
2025-05-07T20:33:03.5582165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5582250Z     
2025-05-07T20:33:03.5582346Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5582351Z 
2025-05-07T20:33:03.5582456Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5582596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5582700Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5582805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5583319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5584026Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5584561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5585254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5586102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5586634Z     kernel = self.compile(
2025-05-07T20:33:03.5587196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5587903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5588307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5588563Z 
2025-05-07T20:33:03.5588780Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671d3d40>
2025-05-07T20:33:03.5589856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5594103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a6948f1a0>}
2025-05-07T20:33:03.5595499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5596620Z context = <triton._C.libtriton.ir.context object at 0x7f8b66cb7c70>
2025-05-07T20:33:03.5596912Z 
2025-05-07T20:33:03.5597083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5597615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5598101Z                            module_map=module_map)
2025-05-07T20:33:03.5598472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5598843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5599116Z E       ^
2025-05-07T20:33:03.5599597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5600077Z 
2025-05-07T20:33:03.5600497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5601020Z 
2025-05-07T20:33:03.5601128Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5601553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5601957Z     T=1,
2025-05-07T20:33:03.5602152Z     D=7168,
2025-05-07T20:33:03.5602360Z     scale_ub=1200.0,
2025-05-07T20:33:03.5602587Z     contiguous=False,
2025-05-07T20:33:03.5602824Z     compiled=False,
2025-05-07T20:33:03.5603043Z )
2025-05-07T20:33:03.5603364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5603861Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5604137Z 
2025-05-07T20:33:03.5604220Z     @given(
2025-05-07T20:33:03.5604465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5604782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5605101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5605442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5605770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5606064Z     )
2025-05-07T20:33:03.5606421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5606868Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5607122Z         self,
2025-05-07T20:33:03.5607328Z         T: int,
2025-05-07T20:33:03.5607530Z         D: int,
2025-05-07T20:33:03.5607758Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5608038Z         contiguous: bool,
2025-05-07T20:33:03.5608286Z         compiled: bool,
2025-05-07T20:33:03.5608510Z     ) -> None:
2025-05-07T20:33:03.5608742Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5609113Z     
2025-05-07T20:33:03.5609390Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5609744Z     
2025-05-07T20:33:03.5609946Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5610369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5610689Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5610940Z         x0 = x[:, :D]
2025-05-07T20:33:03.5611159Z         x1 = x[:, D:]
2025-05-07T20:33:03.5611381Z     
2025-05-07T20:33:03.5611580Z         if contiguous:
2025-05-07T20:33:03.5611814Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5612085Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5612337Z     
2025-05-07T20:33:03.5612533Z         if scale_ub is not None:
2025-05-07T20:33:03.5612817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5613162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5613483Z             )
2025-05-07T20:33:03.5613682Z         else:
2025-05-07T20:33:03.5613910Z             scale_ub_tensor = None
2025-05-07T20:33:03.5614172Z     
2025-05-07T20:33:03.5614511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5614837Z             op = silu_mul_quant
2025-05-07T20:33:03.5615103Z             if compiled:
2025-05-07T20:33:03.5615355Z                 op = torch.compile(op)
2025-05-07T20:33:03.5615659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5615948Z     
2025-05-07T20:33:03.5616149Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5616324Z 
2025-05-07T20:33:03.5616428Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5616735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5617072Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5617365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5618064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5618769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5619310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5620054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5620726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5621265Z     kernel = self.compile(
2025-05-07T20:33:03.5621807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5622474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5622881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5623117Z 
2025-05-07T20:33:03.5623325Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9066bb90>
2025-05-07T20:33:03.5624418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5625803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf0680>}
2025-05-07T20:33:03.5627155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5628187Z context = <triton._C.libtriton.ir.context object at 0x7f8b66c4d430>
2025-05-07T20:33:03.5628480Z 
2025-05-07T20:33:03.5628652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5629238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5629755Z                            module_map=module_map)
2025-05-07T20:33:03.5630132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5630490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5630799Z E       ^
2025-05-07T20:33:03.5631269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5631724Z 
2025-05-07T20:33:03.5632139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5632658Z 
2025-05-07T20:33:03.5632766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5633196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5633608Z     T=4096,
2025-05-07T20:33:03.5633803Z     D=7168,
2025-05-07T20:33:03.5634012Z     scale_ub=1200.0,
2025-05-07T20:33:03.5634250Z     contiguous=False,
2025-05-07T20:33:03.5634491Z     compiled=True,
2025-05-07T20:33:03.5634712Z )
2025-05-07T20:33:03.5635120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5635622Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5636007Z 
2025-05-07T20:33:03.5636092Z     @given(
2025-05-07T20:33:03.5636341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5636666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5636982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5637315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5637656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5637953Z     )
2025-05-07T20:33:03.5638303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5638755Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5639007Z         self,
2025-05-07T20:33:03.5639204Z         T: int,
2025-05-07T20:33:03.5639419Z         D: int,
2025-05-07T20:33:03.5639647Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5639920Z         contiguous: bool,
2025-05-07T20:33:03.5640167Z         compiled: bool,
2025-05-07T20:33:03.5640399Z     ) -> None:
2025-05-07T20:33:03.5640625Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5640870Z     
2025-05-07T20:33:03.5641149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5641500Z     
2025-05-07T20:33:03.5641697Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5641994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5642317Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5642561Z         x0 = x[:, :D]
2025-05-07T20:33:03.5642786Z         x1 = x[:, D:]
2025-05-07T20:33:03.5643003Z     
2025-05-07T20:33:03.5643194Z         if contiguous:
2025-05-07T20:33:03.5643433Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5643697Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5643946Z     
2025-05-07T20:33:03.5644150Z         if scale_ub is not None:
2025-05-07T20:33:03.5644436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5644774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5645097Z             )
2025-05-07T20:33:03.5645299Z         else:
2025-05-07T20:33:03.5645520Z             scale_ub_tensor = None
2025-05-07T20:33:03.5645778Z     
2025-05-07T20:33:03.5646019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5646345Z             op = silu_mul_quant
2025-05-07T20:33:03.5646600Z             if compiled:
2025-05-07T20:33:03.5646858Z                 op = torch.compile(op)
2025-05-07T20:33:03.5647163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5647441Z     
2025-05-07T20:33:03.5647647Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5647814Z 
2025-05-07T20:33:03.5647924Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5648276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5648656Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5648947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5649510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5650111Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5650773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5651466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5652004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5652693Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5653360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5653902Z     kernel = self.compile(
2025-05-07T20:33:03.5654498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5655163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5655571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5655804Z 
2025-05-07T20:33:03.5656017Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b67163dd0>
2025-05-07T20:33:03.5657095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5658475Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf1940>}
2025-05-07T20:33:03.5659836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5660870Z context = <triton._C.libtriton.ir.context object at 0x7f8b66cff0b0>
2025-05-07T20:33:03.5661163Z 
2025-05-07T20:33:03.5661330Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5661862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5662337Z                            module_map=module_map)
2025-05-07T20:33:03.5662708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5663064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5663332Z E       ^
2025-05-07T20:33:03.5663801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5664252Z 
2025-05-07T20:33:03.5664674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5665191Z 
2025-05-07T20:33:03.5665298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5666126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5666543Z     T=128,
2025-05-07T20:33:03.5666735Z     D=7168,
2025-05-07T20:33:03.5666937Z     scale_ub=1200.0,
2025-05-07T20:33:03.5667171Z     contiguous=False,
2025-05-07T20:33:03.5667398Z     compiled=True,
2025-05-07T20:33:03.5667614Z )
2025-05-07T20:33:03.5667945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5668438Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.5668718Z 
2025-05-07T20:33:03.5668800Z     @given(
2025-05-07T20:33:03.5669038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5669501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5669881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5670224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5670562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5670952Z     )
2025-05-07T20:33:03.5671309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5671760Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5672009Z         self,
2025-05-07T20:33:03.5672216Z         T: int,
2025-05-07T20:33:03.5672426Z         D: int,
2025-05-07T20:33:03.5672648Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5672935Z         contiguous: bool,
2025-05-07T20:33:03.5673185Z         compiled: bool,
2025-05-07T20:33:03.5673413Z     ) -> None:
2025-05-07T20:33:03.5673642Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5673897Z     
2025-05-07T20:33:03.5674172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5674531Z     
2025-05-07T20:33:03.5674736Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5675111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5675426Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5675677Z         x0 = x[:, :D]
2025-05-07T20:33:03.5675961Z         x1 = x[:, D:]
2025-05-07T20:33:03.5676172Z     
2025-05-07T20:33:03.5676367Z         if contiguous:
2025-05-07T20:33:03.5676610Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5676872Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5677122Z     
2025-05-07T20:33:03.5677322Z         if scale_ub is not None:
2025-05-07T20:33:03.5677596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5677938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5678256Z             )
2025-05-07T20:33:03.5678452Z         else:
2025-05-07T20:33:03.5678670Z             scale_ub_tensor = None
2025-05-07T20:33:03.5678931Z     
2025-05-07T20:33:03.5679167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5679491Z             op = silu_mul_quant
2025-05-07T20:33:03.5679753Z             if compiled:
2025-05-07T20:33:03.5680007Z                 op = torch.compile(op)
2025-05-07T20:33:03.5680303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5680583Z     
2025-05-07T20:33:03.5680784Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5680950Z 
2025-05-07T20:33:03.5681052Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5681359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5681697Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5681977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5682542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5683108Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5683769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5684457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5685006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5685698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5686359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5686901Z     kernel = self.compile(
2025-05-07T20:33:03.5687445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5688104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5688504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5688742Z 
2025-05-07T20:33:03.5689005Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b671615b0>
2025-05-07T20:33:03.5690129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5691554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf2700>}
2025-05-07T20:33:03.5692904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5693931Z context = <triton._C.libtriton.ir.context object at 0x7f8a692eec30>
2025-05-07T20:33:03.5694228Z 
2025-05-07T20:33:03.5694396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5694929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5695452Z                            module_map=module_map)
2025-05-07T20:33:03.5695825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5696191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5696460Z E       ^
2025-05-07T20:33:03.5696922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5697387Z 
2025-05-07T20:33:03.5697801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5698314Z 
2025-05-07T20:33:03.5698430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5698851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5699254Z     T=2048,
2025-05-07T20:33:03.5699452Z     D=7168,
2025-05-07T20:33:03.5699681Z     scale_ub=None,
2025-05-07T20:33:03.5699925Z     contiguous=True,
2025-05-07T20:33:03.5700156Z     compiled=True,
2025-05-07T20:33:03.5700369Z )
2025-05-07T20:33:03.5700689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5701189Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5701458Z 
2025-05-07T20:33:03.5701546Z     @given(
2025-05-07T20:33:03.5701778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5702101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5702416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5702752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5703080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5703372Z     )
2025-05-07T20:33:03.5703726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5704166Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5704423Z         self,
2025-05-07T20:33:03.5704634Z         T: int,
2025-05-07T20:33:03.5704837Z         D: int,
2025-05-07T20:33:03.5705067Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5705344Z         contiguous: bool,
2025-05-07T20:33:03.5705590Z         compiled: bool,
2025-05-07T20:33:03.5705824Z     ) -> None:
2025-05-07T20:33:03.5706047Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5706291Z     
2025-05-07T20:33:03.5706569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5706920Z     
2025-05-07T20:33:03.5707118Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5707415Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5707731Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5707973Z         x0 = x[:, :D]
2025-05-07T20:33:03.5708198Z         x1 = x[:, D:]
2025-05-07T20:33:03.5708414Z     
2025-05-07T20:33:03.5708603Z         if contiguous:
2025-05-07T20:33:03.5708841Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5709204Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5709456Z     
2025-05-07T20:33:03.5709680Z         if scale_ub is not None:
2025-05-07T20:33:03.5709984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5710368Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5710679Z             )
2025-05-07T20:33:03.5710880Z         else:
2025-05-07T20:33:03.5711104Z             scale_ub_tensor = None
2025-05-07T20:33:03.5711356Z     
2025-05-07T20:33:03.5711594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5711915Z             op = silu_mul_quant
2025-05-07T20:33:03.5712170Z             if compiled:
2025-05-07T20:33:03.5712423Z                 op = torch.compile(op)
2025-05-07T20:33:03.5712725Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5713002Z     
2025-05-07T20:33:03.5713202Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5713376Z 
2025-05-07T20:33:03.5713476Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5713837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5714172Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5714459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5715026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5715586Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5716374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5717066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5717607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5718285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5718957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5719497Z     kernel = self.compile(
2025-05-07T20:33:03.5720038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5720699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5721104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5721335Z 
2025-05-07T20:33:03.5721550Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b66e5edb0>
2025-05-07T20:33:03.5722628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5724009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b66cf37e0>}
2025-05-07T20:33:03.5725358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5726388Z context = <triton._C.libtriton.ir.context object at 0x7f8a692e7b70>
2025-05-07T20:33:03.5726680Z 
2025-05-07T20:33:03.5726855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5727377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5727850Z                            module_map=module_map)
2025-05-07T20:33:03.5728219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5728574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5728841Z E       ^
2025-05-07T20:33:03.5729360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5729852Z 
2025-05-07T20:33:03.5730274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5730853Z 
2025-05-07T20:33:03.5730959Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5731378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5731787Z     T=16384,
2025-05-07T20:33:03.5731984Z     D=5120,
2025-05-07T20:33:03.5732185Z     scale_ub=None,
2025-05-07T20:33:03.5732410Z     contiguous=False,
2025-05-07T20:33:03.5732637Z     compiled=False,
2025-05-07T20:33:03.5732851Z )
2025-05-07T20:33:03.5733191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5733689Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5733977Z 
2025-05-07T20:33:03.5734060Z     @given(
2025-05-07T20:33:03.5741642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5742006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5742424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5742761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5743102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5743387Z     )
2025-05-07T20:33:03.5743743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5744193Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5744442Z         self,
2025-05-07T20:33:03.5744640Z         T: int,
2025-05-07T20:33:03.5744845Z         D: int,
2025-05-07T20:33:03.5745065Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5745348Z         contiguous: bool,
2025-05-07T20:33:03.5745588Z         compiled: bool,
2025-05-07T20:33:03.5745822Z     ) -> None:
2025-05-07T20:33:03.5746049Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5746303Z     
2025-05-07T20:33:03.5746582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5746938Z     
2025-05-07T20:33:03.5747147Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5747437Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5749488Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5751378Z 
2025-05-07T20:33:03.5751498Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:03.5751712Z 
2025-05-07T20:33:03.5751833Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5752264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5752669Z     T=4096,
2025-05-07T20:33:03.5752867Z     D=7168,
2025-05-07T20:33:03.5753070Z     scale_ub=1200.0,
2025-05-07T20:33:03.5753294Z     contiguous=True,
2025-05-07T20:33:03.5753522Z     compiled=True,
2025-05-07T20:33:03.5753737Z )
2025-05-07T20:33:03.5754057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5754572Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5754850Z 
2025-05-07T20:33:03.5754938Z     @given(
2025-05-07T20:33:03.5755172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5755494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5755965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5756303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5756691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5757079Z     )
2025-05-07T20:33:03.5757438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5757884Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5758180Z         self,
2025-05-07T20:33:03.5758385Z         T: int,
2025-05-07T20:33:03.5758585Z         D: int,
2025-05-07T20:33:03.5758810Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5759089Z         contiguous: bool,
2025-05-07T20:33:03.5759329Z         compiled: bool,
2025-05-07T20:33:03.5759559Z     ) -> None:
2025-05-07T20:33:03.5759779Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5760021Z     
2025-05-07T20:33:03.5760296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5760648Z     
2025-05-07T20:33:03.5760843Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5761141Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5763237Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5765129Z 
2025-05-07T20:33:03.5765249Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:03.5765852Z 
2025-05-07T20:33:03.5765968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5766379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5766792Z     T=16384,
2025-05-07T20:33:03.5766993Z     D=7168,
2025-05-07T20:33:03.5767187Z     scale_ub=None,
2025-05-07T20:33:03.5767413Z     contiguous=False,
2025-05-07T20:33:03.5767652Z     compiled=False,
2025-05-07T20:33:03.5767862Z )
2025-05-07T20:33:03.5768191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5768701Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5768983Z 
2025-05-07T20:33:03.5769071Z     @given(
2025-05-07T20:33:03.5769300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5769630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5769943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5770276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5770614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5770909Z     )
2025-05-07T20:33:03.5771257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5771708Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5771957Z         self,
2025-05-07T20:33:03.5772170Z         T: int,
2025-05-07T20:33:03.5772370Z         D: int,
2025-05-07T20:33:03.5772604Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5772886Z         contiguous: bool,
2025-05-07T20:33:03.5773134Z         compiled: bool,
2025-05-07T20:33:03.5773365Z     ) -> None:
2025-05-07T20:33:03.5773591Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5773837Z     
2025-05-07T20:33:03.5774113Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5776349Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5778292Z 
2025-05-07T20:33:03.5778419Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5778633Z 
2025-05-07T20:33:03.5778812Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5779226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5779639Z     T=2048,
2025-05-07T20:33:03.5779834Z     D=7168,
2025-05-07T20:33:03.5780026Z     scale_ub=1200.0,
2025-05-07T20:33:03.5780254Z     contiguous=True,
2025-05-07T20:33:03.5780483Z     compiled=True,
2025-05-07T20:33:03.5780687Z )
2025-05-07T20:33:03.5781009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5781507Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5781778Z 
2025-05-07T20:33:03.5781859Z     @given(
2025-05-07T20:33:03.5782093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5782421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5782808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5783140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5783477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5783772Z     )
2025-05-07T20:33:03.5784118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5784567Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5784815Z         self,
2025-05-07T20:33:03.5785013Z         T: int,
2025-05-07T20:33:03.5785221Z         D: int,
2025-05-07T20:33:03.5785450Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5785723Z         contiguous: bool,
2025-05-07T20:33:03.5785970Z         compiled: bool,
2025-05-07T20:33:03.5786199Z     ) -> None:
2025-05-07T20:33:03.5786421Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5786674Z     
2025-05-07T20:33:03.5786955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5787308Z     
2025-05-07T20:33:03.5787507Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5787807Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5789823Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5791680Z 
2025-05-07T20:33:03.5791806Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:03.5792019Z 
2025-05-07T20:33:03.5792125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5792553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5792967Z     T=2048,
2025-05-07T20:33:03.5793166Z     D=7168,
2025-05-07T20:33:03.5793359Z     scale_ub=None,
2025-05-07T20:33:03.5793583Z     contiguous=True,
2025-05-07T20:33:03.5793815Z     compiled=False,
2025-05-07T20:33:03.5794030Z )
2025-05-07T20:33:03.5794354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5794847Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5795123Z 
2025-05-07T20:33:03.5795203Z     @given(
2025-05-07T20:33:03.5795436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5795820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5796138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5796471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5796803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5797186Z     )
2025-05-07T20:33:03.5797540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5797987Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5798269Z         self,
2025-05-07T20:33:03.5798468Z         T: int,
2025-05-07T20:33:03.5798547Z         D: int,
2025-05-07T20:33:03.5798653Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5798745Z         contiguous: bool,
2025-05-07T20:33:03.5798832Z         compiled: bool,
2025-05-07T20:33:03.5798921Z     ) -> None:
2025-05-07T20:33:03.5799018Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5799094Z     
2025-05-07T20:33:03.5799268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5799346Z     
2025-05-07T20:33:03.5799446Z >       x_sign = torch.sign(x)
2025-05-07T20:33:03.5801282Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5801293Z 
2025-05-07T20:33:03.5801413Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:03.5801424Z 
2025-05-07T20:33:03.5801527Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5801752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5801837Z     T=1,
2025-05-07T20:33:03.5801916Z     D=7168,
2025-05-07T20:33:03.5802001Z     scale_ub=1200.0,
2025-05-07T20:33:03.5802093Z     contiguous=True,
2025-05-07T20:33:03.5802179Z     compiled=False,
2025-05-07T20:33:03.5802255Z )
2025-05-07T20:33:03.5802482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5802653Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5802658Z 
2025-05-07T20:33:03.5802740Z     @given(
2025-05-07T20:33:03.5802866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5802966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5803086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5803204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5803320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5803402Z     )
2025-05-07T20:33:03.5803645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5803739Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5803826Z         self,
2025-05-07T20:33:03.5803905Z         T: int,
2025-05-07T20:33:03.5803983Z         D: int,
2025-05-07T20:33:03.5804099Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5804192Z         contiguous: bool,
2025-05-07T20:33:03.5804288Z         compiled: bool,
2025-05-07T20:33:03.5804369Z     ) -> None:
2025-05-07T20:33:03.5804467Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5804556Z     
2025-05-07T20:33:03.5804727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5804808Z     
2025-05-07T20:33:03.5804907Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5805031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5805125Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5805217Z         x0 = x[:, :D]
2025-05-07T20:33:03.5805300Z         x1 = x[:, D:]
2025-05-07T20:33:03.5805376Z     
2025-05-07T20:33:03.5805468Z         if contiguous:
2025-05-07T20:33:03.5805562Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5805654Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5805737Z     
2025-05-07T20:33:03.5805832Z         if scale_ub is not None:
2025-05-07T20:33:03.5806040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5806183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5806262Z             )
2025-05-07T20:33:03.5806386Z         else:
2025-05-07T20:33:03.5806482Z             scale_ub_tensor = None
2025-05-07T20:33:03.5806558Z     
2025-05-07T20:33:03.5806698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5806791Z             op = silu_mul_quant
2025-05-07T20:33:03.5806879Z             if compiled:
2025-05-07T20:33:03.5806987Z                 op = torch.compile(op)
2025-05-07T20:33:03.5807095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5807173Z     
2025-05-07T20:33:03.5807273Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5807277Z 
2025-05-07T20:33:03.5807378Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5807520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5807623Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5807729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5808287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5808390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5808752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5808983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5809324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5809427Z     kernel = self.compile(
2025-05-07T20:33:03.5809812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5809989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5810133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5810143Z 
2025-05-07T20:33:03.5810350Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aef8c0>
2025-05-07T20:33:03.5811145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5811650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a692dab60>}
2025-05-07T20:33:03.5812400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5812604Z context = <triton._C.libtriton.ir.context object at 0x7f8a69319070>
2025-05-07T20:33:03.5812610Z 
2025-05-07T20:33:03.5812779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5813057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5813169Z                            module_map=module_map)
2025-05-07T20:33:03.5813332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5813439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5813519Z E       ^
2025-05-07T20:33:03.5813885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5813889Z 
2025-05-07T20:33:03.5814304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5814308Z 
2025-05-07T20:33:03.5814416Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5814722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5814842Z     T=128,
2025-05-07T20:33:03.5814924Z     D=5120,
2025-05-07T20:33:03.5815018Z     scale_ub=None,
2025-05-07T20:33:03.5815111Z     contiguous=True,
2025-05-07T20:33:03.5815244Z     compiled=False,
2025-05-07T20:33:03.5815321Z )
2025-05-07T20:33:03.5815538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5815716Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5815721Z 
2025-05-07T20:33:03.5815799Z     @given(
2025-05-07T20:33:03.5815919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5816025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5816141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5816258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5816378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5816455Z     )
2025-05-07T20:33:03.5816766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5816863Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5816941Z         self,
2025-05-07T20:33:03.5817029Z         T: int,
2025-05-07T20:33:03.5817108Z         D: int,
2025-05-07T20:33:03.5817211Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5817309Z         contiguous: bool,
2025-05-07T20:33:03.5817396Z         compiled: bool,
2025-05-07T20:33:03.5817477Z     ) -> None:
2025-05-07T20:33:03.5817583Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5817658Z     
2025-05-07T20:33:03.5817828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5817911Z     
2025-05-07T20:33:03.5818006Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5818137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5818229Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5818312Z         x0 = x[:, :D]
2025-05-07T20:33:03.5818407Z         x1 = x[:, D:]
2025-05-07T20:33:03.5818485Z     
2025-05-07T20:33:03.5818573Z         if contiguous:
2025-05-07T20:33:03.5818675Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5818766Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5818845Z     
2025-05-07T20:33:03.5818947Z         if scale_ub is not None:
2025-05-07T20:33:03.5819055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5819192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5819279Z             )
2025-05-07T20:33:03.5819357Z         else:
2025-05-07T20:33:03.5819454Z             scale_ub_tensor = None
2025-05-07T20:33:03.5819537Z     
2025-05-07T20:33:03.5819668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5819767Z             op = silu_mul_quant
2025-05-07T20:33:03.5819858Z             if compiled:
2025-05-07T20:33:03.5819959Z                 op = torch.compile(op)
2025-05-07T20:33:03.5820073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5820155Z     
2025-05-07T20:33:03.5820248Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5820255Z 
2025-05-07T20:33:03.5820363Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5820496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5820602Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5820713Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5821212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5821317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5821676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5821901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5822293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5822428Z     kernel = self.compile(
2025-05-07T20:33:03.5822820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5823036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5823168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5823172Z 
2025-05-07T20:33:03.5823386Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a69aecd70>
2025-05-07T20:33:03.5824172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5824686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a692dbc40>}
2025-05-07T20:33:03.5825481Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5825677Z context = <triton._C.libtriton.ir.context object at 0x7f8a690eb8f0>
2025-05-07T20:33:03.5825682Z 
2025-05-07T20:33:03.5825858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5826125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5826243Z                            module_map=module_map)
2025-05-07T20:33:03.5826410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5826516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5826605Z E       ^
2025-05-07T20:33:03.5826966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5826975Z 
2025-05-07T20:33:03.5827390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5827404Z 
2025-05-07T20:33:03.5827510Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5827738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5827829Z     T=128,
2025-05-07T20:33:03.5827910Z     D=7168,
2025-05-07T20:33:03.5827996Z     scale_ub=None,
2025-05-07T20:33:03.5828093Z     contiguous=True,
2025-05-07T20:33:03.5828184Z     compiled=False,
2025-05-07T20:33:03.5828262Z )
2025-05-07T20:33:03.5828487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5828658Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5828662Z 
2025-05-07T20:33:03.5828748Z     @given(
2025-05-07T20:33:03.5828868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5828975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5829103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5829226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5829344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5829428Z     )
2025-05-07T20:33:03.5829696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5829801Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5829908Z         self,
2025-05-07T20:33:03.5829991Z         T: int,
2025-05-07T20:33:03.5830071Z         D: int,
2025-05-07T20:33:03.5830180Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5830272Z         contiguous: bool,
2025-05-07T20:33:03.5830367Z         compiled: bool,
2025-05-07T20:33:03.5830450Z     ) -> None:
2025-05-07T20:33:03.5830548Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5830633Z     
2025-05-07T20:33:03.5830853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5830995Z     
2025-05-07T20:33:03.5831098Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5831231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5831365Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5831455Z         x0 = x[:, :D]
2025-05-07T20:33:03.5831541Z         x1 = x[:, D:]
2025-05-07T20:33:03.5831618Z     
2025-05-07T20:33:03.5831712Z         if contiguous:
2025-05-07T20:33:03.5831807Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5831905Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5831981Z     
2025-05-07T20:33:03.5832075Z         if scale_ub is not None:
2025-05-07T20:33:03.5832187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5832328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5832408Z             )
2025-05-07T20:33:03.5832493Z         else:
2025-05-07T20:33:03.5832593Z             scale_ub_tensor = None
2025-05-07T20:33:03.5832669Z     
2025-05-07T20:33:03.5832812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5832956Z             op = silu_mul_quant
2025-05-07T20:33:03.5833045Z             if compiled:
2025-05-07T20:33:03.5833158Z                 op = torch.compile(op)
2025-05-07T20:33:03.5833266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5833350Z     
2025-05-07T20:33:03.5833444Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5833448Z 
2025-05-07T20:33:03.5833548Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5833685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5833788Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5833895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5834398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5834496Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5834873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5835097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5835440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5835543Z     kernel = self.compile(
2025-05-07T20:33:03.5836083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5836264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5836402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5836407Z 
2025-05-07T20:33:03.5836613Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a690442c0>
2025-05-07T20:33:03.5837406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5837913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a69074ae0>}
2025-05-07T20:33:03.5838668Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5838861Z context = <triton._C.libtriton.ir.context object at 0x7f8a690104f0>
2025-05-07T20:33:03.5838865Z 
2025-05-07T20:33:03.5839032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5839303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5839415Z                            module_map=module_map)
2025-05-07T20:33:03.5839675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5839789Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5839873Z E       ^
2025-05-07T20:33:03.5840275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5840280Z 
2025-05-07T20:33:03.5840693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5840697Z 
2025-05-07T20:33:03.5840806Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5841038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5841119Z     T=2048,
2025-05-07T20:33:03.5841207Z     D=7168,
2025-05-07T20:33:03.5841299Z     scale_ub=1200.0,
2025-05-07T20:33:03.5841392Z     contiguous=True,
2025-05-07T20:33:03.5841490Z     compiled=False,
2025-05-07T20:33:03.5841569Z )
2025-05-07T20:33:03.5841799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5842055Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5842061Z 
2025-05-07T20:33:03.5842144Z     @given(
2025-05-07T20:33:03.5842269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5842377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5842496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5842625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5842742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5842821Z     )
2025-05-07T20:33:03.5843078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5843177Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5843257Z         self,
2025-05-07T20:33:03.5843346Z         T: int,
2025-05-07T20:33:03.5843426Z         D: int,
2025-05-07T20:33:03.5843531Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5843633Z         contiguous: bool,
2025-05-07T20:33:03.5843728Z         compiled: bool,
2025-05-07T20:33:03.5843810Z     ) -> None:
2025-05-07T20:33:03.5843915Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5843996Z     
2025-05-07T20:33:03.5844170Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5845966Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5845971Z 
2025-05-07T20:33:03.5846102Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5846106Z 
2025-05-07T20:33:03.5846214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5846438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5846525Z     T=1,
2025-05-07T20:33:03.5846608Z     D=5120,
2025-05-07T20:33:03.5846694Z     scale_ub=1200.0,
2025-05-07T20:33:03.5846787Z     contiguous=True,
2025-05-07T20:33:03.5846873Z     compiled=False,
2025-05-07T20:33:03.5846949Z )
2025-05-07T20:33:03.5847175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5847340Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5847345Z 
2025-05-07T20:33:03.5847432Z     @given(
2025-05-07T20:33:03.5847553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5847652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5847823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5847984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5848100Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5848184Z     )
2025-05-07T20:33:03.5848472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5848567Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5848650Z         self,
2025-05-07T20:33:03.5848729Z         T: int,
2025-05-07T20:33:03.5848818Z         D: int,
2025-05-07T20:33:03.5848918Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5849007Z         contiguous: bool,
2025-05-07T20:33:03.5849099Z         compiled: bool,
2025-05-07T20:33:03.5849179Z     ) -> None:
2025-05-07T20:33:03.5849276Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5849357Z     
2025-05-07T20:33:03.5849524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5849600Z     
2025-05-07T20:33:03.5849706Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5849836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5849976Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5850069Z         x0 = x[:, :D]
2025-05-07T20:33:03.5850155Z         x1 = x[:, D:]
2025-05-07T20:33:03.5850232Z     
2025-05-07T20:33:03.5850327Z         if contiguous:
2025-05-07T20:33:03.5850424Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5850517Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5850602Z     
2025-05-07T20:33:03.5850695Z         if scale_ub is not None:
2025-05-07T20:33:03.5850809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5850945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5851024Z             )
2025-05-07T20:33:03.5851111Z         else:
2025-05-07T20:33:03.5851210Z             scale_ub_tensor = None
2025-05-07T20:33:03.5851286Z     
2025-05-07T20:33:03.5851422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5851517Z             op = silu_mul_quant
2025-05-07T20:33:03.5851607Z             if compiled:
2025-05-07T20:33:03.5851717Z                 op = torch.compile(op)
2025-05-07T20:33:03.5851825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5851903Z     
2025-05-07T20:33:03.5852002Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5852007Z 
2025-05-07T20:33:03.5852106Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5852242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5852345Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5852446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5852949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5853048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5853412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5853645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5853991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5854095Z     kernel = self.compile(
2025-05-07T20:33:03.5854479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5854656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5854791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5854796Z 
2025-05-07T20:33:03.5855000Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a690452b0>
2025-05-07T20:33:03.5855831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5856476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a690760c0>}
2025-05-07T20:33:03.5857265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5857466Z context = <triton._C.libtriton.ir.context object at 0x7f8a690c86b0>
2025-05-07T20:33:03.5857470Z 
2025-05-07T20:33:03.5857637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5857912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5858022Z                            module_map=module_map)
2025-05-07T20:33:03.5858186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5858304Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5858383Z E       ^
2025-05-07T20:33:03.5858820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5858828Z 
2025-05-07T20:33:03.5859241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5859246Z 
2025-05-07T20:33:03.5859352Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5859600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5859688Z     T=2048,
2025-05-07T20:33:03.5859782Z     D=5120,
2025-05-07T20:33:03.5859882Z     scale_ub=None,
2025-05-07T20:33:03.5859969Z     contiguous=True,
2025-05-07T20:33:03.5860061Z     compiled=False,
2025-05-07T20:33:03.5860138Z )
2025-05-07T20:33:03.5860357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5860542Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5860549Z 
2025-05-07T20:33:03.5860631Z     @given(
2025-05-07T20:33:03.5860751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5860860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5860978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5861098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5861219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5861296Z     )
2025-05-07T20:33:03.5861549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5861645Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5861728Z         self,
2025-05-07T20:33:03.5861814Z         T: int,
2025-05-07T20:33:03.5861894Z         D: int,
2025-05-07T20:33:03.5861994Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5862096Z         contiguous: bool,
2025-05-07T20:33:03.5862187Z         compiled: bool,
2025-05-07T20:33:03.5862271Z     ) -> None:
2025-05-07T20:33:03.5862378Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5862455Z     
2025-05-07T20:33:03.5862628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5862717Z     
2025-05-07T20:33:03.5862812Z >       x_sign = torch.sign(x)
2025-05-07T20:33:03.5864611Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5864617Z 
2025-05-07T20:33:03.5864787Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:03.5864829Z 
2025-05-07T20:33:03.5864944Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5865173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5865298Z     T=16384,
2025-05-07T20:33:03.5865770Z     D=5120,
2025-05-07T20:33:03.5865859Z     scale_ub=None,
2025-05-07T20:33:03.5865949Z     contiguous=True,
2025-05-07T20:33:03.5866043Z     compiled=False,
2025-05-07T20:33:03.5866122Z )
2025-05-07T20:33:03.5866341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5866524Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5866530Z 
2025-05-07T20:33:03.5866611Z     @given(
2025-05-07T20:33:03.5866738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5866840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5866958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5867093Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5867327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5867408Z     )
2025-05-07T20:33:03.5867662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5867763Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5867844Z         self,
2025-05-07T20:33:03.5867935Z         T: int,
2025-05-07T20:33:03.5868016Z         D: int,
2025-05-07T20:33:03.5868123Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5868215Z         contiguous: bool,
2025-05-07T20:33:03.5868304Z         compiled: bool,
2025-05-07T20:33:03.5868391Z     ) -> None:
2025-05-07T20:33:03.5868489Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5868566Z     
2025-05-07T20:33:03.5868740Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5870535Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5870544Z 
2025-05-07T20:33:03.5870674Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5870680Z 
2025-05-07T20:33:03.5870785Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5871010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5871097Z     T=4096,
2025-05-07T20:33:03.5871179Z     D=5120,
2025-05-07T20:33:03.5871269Z     scale_ub=None,
2025-05-07T20:33:03.5871358Z     contiguous=True,
2025-05-07T20:33:03.5871445Z     compiled=False,
2025-05-07T20:33:03.5871533Z )
2025-05-07T20:33:03.5871755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5871927Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5871934Z 
2025-05-07T20:33:03.5872021Z     @given(
2025-05-07T20:33:03.5872142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5872243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5872366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5872486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5872608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5872687Z     )
2025-05-07T20:33:03.5872931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5873033Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5873114Z         self,
2025-05-07T20:33:03.5873199Z         T: int,
2025-05-07T20:33:03.5873285Z         D: int,
2025-05-07T20:33:03.5873522Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5873618Z         contiguous: bool,
2025-05-07T20:33:03.5873712Z         compiled: bool,
2025-05-07T20:33:03.5873791Z     ) -> None:
2025-05-07T20:33:03.5873989Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5874072Z     
2025-05-07T20:33:03.5874242Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5876088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5876100Z 
2025-05-07T20:33:03.5876220Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5876267Z 
2025-05-07T20:33:03.5876379Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5876606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5876687Z     T=2048,
2025-05-07T20:33:03.5876775Z     D=5120,
2025-05-07T20:33:03.5876859Z     scale_ub=None,
2025-05-07T20:33:03.5876949Z     contiguous=False,
2025-05-07T20:33:03.5895084Z     compiled=False,
2025-05-07T20:33:03.5895187Z )
2025-05-07T20:33:03.5895431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5895625Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5895630Z 
2025-05-07T20:33:03.5895713Z     @given(
2025-05-07T20:33:03.5895848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5895954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5896085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5896217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5896335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5896419Z     )
2025-05-07T20:33:03.5896677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5896778Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5896861Z         self,
2025-05-07T20:33:03.5896951Z         T: int,
2025-05-07T20:33:03.5897033Z         D: int,
2025-05-07T20:33:03.5897137Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5897239Z         contiguous: bool,
2025-05-07T20:33:03.5897329Z         compiled: bool,
2025-05-07T20:33:03.5897422Z     ) -> None:
2025-05-07T20:33:03.5897522Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5897601Z     
2025-05-07T20:33:03.5897784Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5899599Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5899608Z 
2025-05-07T20:33:03.5899739Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5899743Z 
2025-05-07T20:33:03.5899852Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5900080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5900174Z     T=4096,
2025-05-07T20:33:03.5900259Z     D=7168,
2025-05-07T20:33:03.5900348Z     scale_ub=None,
2025-05-07T20:33:03.5900444Z     contiguous=True,
2025-05-07T20:33:03.5900694Z     compiled=True,
2025-05-07T20:33:03.5900781Z )
2025-05-07T20:33:03.5901006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5901226Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5901231Z 
2025-05-07T20:33:03.5901325Z     @given(
2025-05-07T20:33:03.5901448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5901552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5901680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5901801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5901921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5902009Z     )
2025-05-07T20:33:03.5902258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5902363Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5902450Z         self,
2025-05-07T20:33:03.5902536Z         T: int,
2025-05-07T20:33:03.5902625Z         D: int,
2025-05-07T20:33:03.5902771Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5902865Z         contiguous: bool,
2025-05-07T20:33:03.5902967Z         compiled: bool,
2025-05-07T20:33:03.5903052Z     ) -> None:
2025-05-07T20:33:03.5903152Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5903237Z     
2025-05-07T20:33:03.5903411Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5905217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5905227Z 
2025-05-07T20:33:03.5905349Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5905353Z 
2025-05-07T20:33:03.5905467Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5905695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5905777Z     T=2048,
2025-05-07T20:33:03.5905866Z     D=5120,
2025-05-07T20:33:03.5905956Z     scale_ub=1200.0,
2025-05-07T20:33:03.5906046Z     contiguous=False,
2025-05-07T20:33:03.5906142Z     compiled=False,
2025-05-07T20:33:03.5906221Z )
2025-05-07T20:33:03.5906442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5906634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5906639Z 
2025-05-07T20:33:03.5906723Z     @given(
2025-05-07T20:33:03.5906857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5906964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5907086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5907212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5907331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5907409Z     )
2025-05-07T20:33:03.5907662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5907760Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5907841Z         self,
2025-05-07T20:33:03.5907930Z         T: int,
2025-05-07T20:33:03.5908011Z         D: int,
2025-05-07T20:33:03.5908115Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5908215Z         contiguous: bool,
2025-05-07T20:33:03.5908307Z         compiled: bool,
2025-05-07T20:33:03.5908394Z     ) -> None:
2025-05-07T20:33:03.5908492Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5908570Z     
2025-05-07T20:33:03.5908800Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5910631Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5910701Z 
2025-05-07T20:33:03.5910828Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5910832Z 
2025-05-07T20:33:03.5910939Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5911166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5911255Z     T=4096,
2025-05-07T20:33:03.5911337Z     D=7168,
2025-05-07T20:33:03.5911430Z     scale_ub=1200.0,
2025-05-07T20:33:03.5911526Z     contiguous=True,
2025-05-07T20:33:03.5911653Z     compiled=False,
2025-05-07T20:33:03.5911740Z )
2025-05-07T20:33:03.5911963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5912144Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5912149Z 
2025-05-07T20:33:03.5912240Z     @given(
2025-05-07T20:33:03.5912363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5912467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5912594Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5912713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5912832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5912919Z     )
2025-05-07T20:33:03.5913165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5913278Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5913363Z         self,
2025-05-07T20:33:03.5913447Z         T: int,
2025-05-07T20:33:03.5913536Z         D: int,
2025-05-07T20:33:03.5913638Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5913735Z         contiguous: bool,
2025-05-07T20:33:03.5913833Z         compiled: bool,
2025-05-07T20:33:03.5913916Z     ) -> None:
2025-05-07T20:33:03.5914017Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5914104Z     
2025-05-07T20:33:03.5914276Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5916200Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5916211Z 
2025-05-07T20:33:03.5916335Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5916339Z 
2025-05-07T20:33:03.5916453Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5916680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5916763Z     T=16384,
2025-05-07T20:33:03.5916852Z     D=7168,
2025-05-07T20:33:03.5916941Z     scale_ub=None,
2025-05-07T20:33:03.5917031Z     contiguous=False,
2025-05-07T20:33:03.5917125Z     compiled=True,
2025-05-07T20:33:03.5917204Z )
2025-05-07T20:33:03.5917424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5917611Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5917615Z 
2025-05-07T20:33:03.5917698Z     @given(
2025-05-07T20:33:03.5917921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5918029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5918149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5918323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5918441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5918521Z     )
2025-05-07T20:33:03.5918775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5918874Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5918955Z         self,
2025-05-07T20:33:03.5919048Z         T: int,
2025-05-07T20:33:03.5919129Z         D: int,
2025-05-07T20:33:03.5919232Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5919332Z         contiguous: bool,
2025-05-07T20:33:03.5919421Z         compiled: bool,
2025-05-07T20:33:03.5919530Z     ) -> None:
2025-05-07T20:33:03.5919638Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5919734Z     
2025-05-07T20:33:03.5919958Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5921756Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5921765Z 
2025-05-07T20:33:03.5921894Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5921898Z 
2025-05-07T20:33:03.5922004Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5922231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5922326Z     T=4096,
2025-05-07T20:33:03.5922409Z     D=7168,
2025-05-07T20:33:03.5922498Z     scale_ub=None,
2025-05-07T20:33:03.5922595Z     contiguous=True,
2025-05-07T20:33:03.5922683Z     compiled=False,
2025-05-07T20:33:03.5922770Z )
2025-05-07T20:33:03.5922990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5923167Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5923172Z 
2025-05-07T20:33:03.5923266Z     @given(
2025-05-07T20:33:03.5923389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5923492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5923618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5923738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5923856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5923944Z     )
2025-05-07T20:33:03.5924193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5924305Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5924386Z         self,
2025-05-07T20:33:03.5924467Z         T: int,
2025-05-07T20:33:03.5924562Z         D: int,
2025-05-07T20:33:03.5924665Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5924759Z         contiguous: bool,
2025-05-07T20:33:03.5924859Z         compiled: bool,
2025-05-07T20:33:03.5924942Z     ) -> None:
2025-05-07T20:33:03.5925041Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5925125Z     
2025-05-07T20:33:03.5925297Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5927147Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5927229Z 
2025-05-07T20:33:03.5927351Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5927355Z 
2025-05-07T20:33:03.5927467Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5927694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5927776Z     T=16384,
2025-05-07T20:33:03.5927863Z     D=7168,
2025-05-07T20:33:03.5927952Z     scale_ub=None,
2025-05-07T20:33:03.5928043Z     contiguous=True,
2025-05-07T20:33:03.5928138Z     compiled=False,
2025-05-07T20:33:03.5928215Z )
2025-05-07T20:33:03.5928437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5928623Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:03.5928634Z 
2025-05-07T20:33:03.5928718Z     @given(
2025-05-07T20:33:03.5928894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5928999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5929121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5929250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5929370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5929451Z     )
2025-05-07T20:33:03.5929711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5929811Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5929897Z         self,
2025-05-07T20:33:03.5929987Z         T: int,
2025-05-07T20:33:03.5930067Z         D: int,
2025-05-07T20:33:03.5930170Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5930272Z         contiguous: bool,
2025-05-07T20:33:03.5930362Z         compiled: bool,
2025-05-07T20:33:03.5930456Z     ) -> None:
2025-05-07T20:33:03.5930563Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5930641Z     
2025-05-07T20:33:03.5930826Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5932616Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5932624Z 
2025-05-07T20:33:03.5932753Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5932758Z 
2025-05-07T20:33:03.5932866Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5933095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5933188Z     T=16384,
2025-05-07T20:33:03.5933272Z     D=7168,
2025-05-07T20:33:03.5933359Z     scale_ub=1200.0,
2025-05-07T20:33:03.5933462Z     contiguous=True,
2025-05-07T20:33:03.5933550Z     compiled=False,
2025-05-07T20:33:03.5933636Z )
2025-05-07T20:33:03.5933857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5934036Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5934041Z 
2025-05-07T20:33:03.5934132Z     @given(
2025-05-07T20:33:03.5934256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5934359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5934485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5934606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5934724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5934901Z     )
2025-05-07T20:33:03.5935153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5935264Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5935383Z         self,
2025-05-07T20:33:03.5935465Z         T: int,
2025-05-07T20:33:03.5935551Z         D: int,
2025-05-07T20:33:03.5935659Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5935755Z         contiguous: bool,
2025-05-07T20:33:03.5935844Z         compiled: bool,
2025-05-07T20:33:03.5935933Z     ) -> None:
2025-05-07T20:33:03.5936032Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5936110Z     
2025-05-07T20:33:03.5936290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5938122Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5938133Z 
2025-05-07T20:33:03.5938259Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5938263Z 
2025-05-07T20:33:03.5938368Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5938601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5938687Z     T=128,
2025-05-07T20:33:03.5938767Z     D=5120,
2025-05-07T20:33:03.5938860Z     scale_ub=1200.0,
2025-05-07T20:33:03.5938949Z     contiguous=False,
2025-05-07T20:33:03.5939037Z     compiled=False,
2025-05-07T20:33:03.5939119Z )
2025-05-07T20:33:03.5939337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5939520Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5939527Z 
2025-05-07T20:33:03.5939614Z     @given(
2025-05-07T20:33:03.5939736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5939847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5939964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5940083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5940205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5940283Z     )
2025-05-07T20:33:03.5940527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5940629Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5940708Z         self,
2025-05-07T20:33:03.5940789Z         T: int,
2025-05-07T20:33:03.5940876Z         D: int,
2025-05-07T20:33:03.5940980Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5941076Z         contiguous: bool,
2025-05-07T20:33:03.5941177Z         compiled: bool,
2025-05-07T20:33:03.5941259Z     ) -> None:
2025-05-07T20:33:03.5941365Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5941443Z     
2025-05-07T20:33:03.5941613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5941699Z     
2025-05-07T20:33:03.5941796Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5941924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5942023Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5942108Z         x0 = x[:, :D]
2025-05-07T20:33:03.5942192Z         x1 = x[:, D:]
2025-05-07T20:33:03.5942273Z     
2025-05-07T20:33:03.5942361Z         if contiguous:
2025-05-07T20:33:03.5942459Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5942559Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5942635Z     
2025-05-07T20:33:03.5942728Z         if scale_ub is not None:
2025-05-07T20:33:03.5942848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5943036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5943185Z             )
2025-05-07T20:33:03.5943269Z         else:
2025-05-07T20:33:03.5943369Z             scale_ub_tensor = None
2025-05-07T20:33:03.5943492Z     
2025-05-07T20:33:03.5943625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5943721Z             op = silu_mul_quant
2025-05-07T20:33:03.5943817Z             if compiled:
2025-05-07T20:33:03.5943920Z                 op = torch.compile(op)
2025-05-07T20:33:03.5944028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5944110Z     
2025-05-07T20:33:03.5944204Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5944209Z 
2025-05-07T20:33:03.5944316Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5944448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5944553Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5944662Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5945220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5945324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5945700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5945926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5946276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5946374Z     kernel = self.compile(
2025-05-07T20:33:03.5946763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5946949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5947081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5947091Z 
2025-05-07T20:33:03.5947302Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a6910d580>
2025-05-07T20:33:03.5948095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5948609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a68f04cc0>}
2025-05-07T20:33:03.5949371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5949565Z context = <triton._C.libtriton.ir.context object at 0x7f8a68f3f5f0>
2025-05-07T20:33:03.5949570Z 
2025-05-07T20:33:03.5949745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5950017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5950131Z                            module_map=module_map)
2025-05-07T20:33:03.5950303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5950406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5950491Z E       ^
2025-05-07T20:33:03.5950861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5950866Z 
2025-05-07T20:33:03.5951281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5951286Z 
2025-05-07T20:33:03.5951399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5951626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5951708Z     T=2048,
2025-05-07T20:33:03.5951970Z     D=7168,
2025-05-07T20:33:03.5952057Z     scale_ub=None,
2025-05-07T20:33:03.5952153Z     contiguous=False,
2025-05-07T20:33:03.5952248Z     compiled=False,
2025-05-07T20:33:03.5952324Z )
2025-05-07T20:33:03.5952589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5952766Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5952770Z 
2025-05-07T20:33:03.5952851Z     @given(
2025-05-07T20:33:03.5952979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5953083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5953201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5953327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5953444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5953527Z     )
2025-05-07T20:33:03.5953773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5953875Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5954005Z         self,
2025-05-07T20:33:03.5954086Z         T: int,
2025-05-07T20:33:03.5954166Z         D: int,
2025-05-07T20:33:03.5954275Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5954371Z         contiguous: bool,
2025-05-07T20:33:03.5954460Z         compiled: bool,
2025-05-07T20:33:03.5954548Z     ) -> None:
2025-05-07T20:33:03.5954646Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5954722Z     
2025-05-07T20:33:03.5954900Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5956805Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5956816Z 
2025-05-07T20:33:03.5956941Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5956945Z 
2025-05-07T20:33:03.5957050Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5957280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5957362Z     T=128,
2025-05-07T20:33:03.5957443Z     D=7168,
2025-05-07T20:33:03.5957534Z     scale_ub=1200.0,
2025-05-07T20:33:03.5957621Z     contiguous=True,
2025-05-07T20:33:03.5957708Z     compiled=True,
2025-05-07T20:33:03.5957791Z )
2025-05-07T20:33:03.5958011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5958182Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5958187Z 
2025-05-07T20:33:03.5958278Z     @given(
2025-05-07T20:33:03.5958399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5958510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5958628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5958750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5958873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5958952Z     )
2025-05-07T20:33:03.5959197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5959312Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5959396Z         self,
2025-05-07T20:33:03.5959486Z         T: int,
2025-05-07T20:33:03.5959588Z         D: int,
2025-05-07T20:33:03.5959706Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5959807Z         contiguous: bool,
2025-05-07T20:33:03.5959903Z         compiled: bool,
2025-05-07T20:33:03.5959988Z     ) -> None:
2025-05-07T20:33:03.5960086Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5960260Z     
2025-05-07T20:33:03.5960435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5960525Z     
2025-05-07T20:33:03.5960622Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5960792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5960893Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5960979Z         x0 = x[:, :D]
2025-05-07T20:33:03.5961064Z         x1 = x[:, D:]
2025-05-07T20:33:03.5961149Z     
2025-05-07T20:33:03.5961236Z         if contiguous:
2025-05-07T20:33:03.5961333Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5961433Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5961510Z     
2025-05-07T20:33:03.5961603Z         if scale_ub is not None:
2025-05-07T20:33:03.5961719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5961858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5961939Z             )
2025-05-07T20:33:03.5962028Z         else:
2025-05-07T20:33:03.5962132Z             scale_ub_tensor = None
2025-05-07T20:33:03.5962258Z     
2025-05-07T20:33:03.5962394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5962496Z             op = silu_mul_quant
2025-05-07T20:33:03.5962593Z             if compiled:
2025-05-07T20:33:03.5962696Z                 op = torch.compile(op)
2025-05-07T20:33:03.5962805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5962890Z     
2025-05-07T20:33:03.5962986Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5962991Z 
2025-05-07T20:33:03.5963092Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5963231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5963336Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5963445Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5963817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5963923Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5964426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5964606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5964999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5965320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5966012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5966146Z     kernel = self.compile(
2025-05-07T20:33:03.5966910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5967123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5967351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5967359Z 
2025-05-07T20:33:03.5967602Z self = <triton.compiler.compiler.ASTSource object at 0x7f8a68e2f7d0>
2025-05-07T20:33:03.5968420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5969050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8bd481d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8a68f05a80>}
2025-05-07T20:33:03.5969848Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5970130Z context = <triton._C.libtriton.ir.context object at 0x7f8a68ea47f0>
2025-05-07T20:33:03.5970361Z 
2025-05-07T20:33:03.5970568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5970867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5971098Z                            module_map=module_map)
2025-05-07T20:33:03.5971365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5971576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5971686Z E       ^
2025-05-07T20:33:03.5972134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5972139Z 
2025-05-07T20:33:03.5972619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5972624Z 
2025-05-07T20:33:03.5972746Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5973141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5973256Z     T=128,
2025-05-07T20:33:03.5973434Z     D=7168,
2025-05-07T20:33:03.5973584Z     scale_ub=1200.0,
2025-05-07T20:33:03.5973699Z     contiguous=True,
2025-05-07T20:33:03.5973804Z     compiled=False,
2025-05-07T20:33:03.5974031Z )
2025-05-07T20:33:03.5974282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5974484Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:03.5974525Z 
2025-05-07T20:33:03.5974636Z     @given(
2025-05-07T20:33:03.5974785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5975005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5975166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5975314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5975492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5975599Z     )
2025-05-07T20:33:03.5975904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5976102Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5976227Z         self,
2025-05-07T20:33:03.5976429Z         T: int,
2025-05-07T20:33:03.5976541Z         D: int,
2025-05-07T20:33:03.5976672Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5976835Z         contiguous: bool,
2025-05-07T20:33:03.5976998Z         compiled: bool,
2025-05-07T20:33:03.5977129Z     ) -> None:
2025-05-07T20:33:03.5977290Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5977395Z     
2025-05-07T20:33:03.5977618Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5977747Z     
2025-05-07T20:33:03.5977917Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5978132Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5979966Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5979976Z 
2025-05-07T20:33:03.5980183Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:03.5980188Z 
2025-05-07T20:33:03.5980325Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5980565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5980768Z     T=128,
2025-05-07T20:33:03.5980878Z     D=5120,
2025-05-07T20:33:03.5980993Z     scale_ub=1200.0,
2025-05-07T20:33:03.5981167Z     contiguous=True,
2025-05-07T20:33:03.5981281Z     compiled=True,
2025-05-07T20:33:03.5981455Z )
2025-05-07T20:33:03.5981882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5982085Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.5982151Z 
2025-05-07T20:33:03.5982297Z     @given(
2025-05-07T20:33:03.5982446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5982575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5982795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5982957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5983124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5983267Z     )
2025-05-07T20:33:03.5983541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5983688Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5983847Z         self,
2025-05-07T20:33:03.5983993Z         T: int,
2025-05-07T20:33:03.5984135Z         D: int,
2025-05-07T20:33:03.5984271Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5984392Z         contiguous: bool,
2025-05-07T20:33:03.5984573Z         compiled: bool,
2025-05-07T20:33:03.5984762Z     ) -> None:
2025-05-07T20:33:03.5984933Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5985074Z     
2025-05-07T20:33:03.5985277Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5985423Z     
2025-05-07T20:33:03.5985533Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5985835Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5987707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5987716Z 
2025-05-07T20:33:03.5987868Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:03.5987876Z 
2025-05-07T20:33:03.5988046Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5988303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5988400Z     T=128,
2025-05-07T20:33:03.5988640Z     D=7168,
2025-05-07T20:33:03.5988757Z     scale_ub=None,
2025-05-07T20:33:03.5988911Z     contiguous=True,
2025-05-07T20:33:03.5989029Z     compiled=True,
2025-05-07T20:33:03.5989138Z )
2025-05-07T20:33:03.5989474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5989713Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.5989718Z 
2025-05-07T20:33:03.5989856Z     @given(
2025-05-07T20:33:03.5990053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5990191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5990362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5990582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5990748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5990895Z     )
2025-05-07T20:33:03.5991171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5991301Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5991451Z         self,
2025-05-07T20:33:03.5991684Z         T: int,
2025-05-07T20:33:03.5991811Z         D: int,
2025-05-07T20:33:03.5991983Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5992107Z         contiguous: bool,
2025-05-07T20:33:03.5992246Z         compiled: bool,
2025-05-07T20:33:03.5992379Z     ) -> None:
2025-05-07T20:33:03.5992551Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5992707Z     
2025-05-07T20:33:03.5993001Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5994840Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:03.5994922Z 
2025-05-07T20:33:03.5995071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:03.5995224Z =============================== warnings summary ===============================
2025-05-07T20:33:03.5995668Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:03.5996180Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:03.5996552Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:03.5997465Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:03.5997730Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:03.5997734Z 
2025-05-07T20:33:03.5998048Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:03.5998265Z ================= 1 failed, 1 deselected, 3 warnings in 15.28s =================
2025-05-07T20:33:05.2145062Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:05.2762711Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:33:05.2762974Z 
2025-05-07T20:33:07.2778566Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:09.4357853Z ============================= test session starts ==============================
2025-05-07T20:33:09.4358684Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:09.4359269Z cachedir: .pytest_cache
2025-05-07T20:33:09.4359973Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:09.4360954Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:09.4361516Z plugins: hypothesis-6.131.14
2025-05-07T20:33:11.0545777Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:11.1633719Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:11.1634299Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:11.1634601Z 
2025-05-07T20:33:13.5127578Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.5128605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.5129112Z     T=1,
2025-05-07T20:33:13.5129470Z     D=5120,
2025-05-07T20:33:13.5129768Z     scale_ub=None,
2025-05-07T20:33:13.5130074Z     contiguous=True,
2025-05-07T20:33:13.5130444Z     compiled=True,
2025-05-07T20:33:13.5130744Z )
2025-05-07T20:33:13.5131147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.5132139Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.5132496Z 
2025-05-07T20:33:13.5132617Z     @given(
2025-05-07T20:33:13.5132941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.5133473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.5133896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.5134304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.5134755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.5135155Z     )
2025-05-07T20:33:13.5135582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.5136204Z     def test_silu_mul_quant(
2025-05-07T20:33:13.5136505Z         self,
2025-05-07T20:33:13.5136778Z         T: int,
2025-05-07T20:33:13.5137148Z         D: int,
2025-05-07T20:33:13.5137426Z         scale_ub: Optional[float],
2025-05-07T20:33:13.5137777Z         contiguous: bool,
2025-05-07T20:33:13.5138202Z         compiled: bool,
2025-05-07T20:33:13.5138488Z     ) -> None:
2025-05-07T20:33:13.5138870Z         torch.manual_seed(2025)
2025-05-07T20:33:13.5139291Z     
2025-05-07T20:33:13.5139625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.5140073Z     
2025-05-07T20:33:13.5140436Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.5140816Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.5141193Z         x = x_sign * x_clamp
2025-05-07T20:33:13.5141595Z         x0 = x[:, :D]
2025-05-07T20:33:13.5141902Z         x1 = x[:, D:]
2025-05-07T20:33:13.5142176Z     
2025-05-07T20:33:13.5142514Z         if contiguous:
2025-05-07T20:33:13.5142861Z             x0 = x0.contiguous()
2025-05-07T20:33:13.5143185Z             x1 = x1.contiguous()
2025-05-07T20:33:13.5150121Z     
2025-05-07T20:33:13.5150340Z         if scale_ub is not None:
2025-05-07T20:33:13.5150621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.5150975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.5151297Z             )
2025-05-07T20:33:13.5151494Z         else:
2025-05-07T20:33:13.5151714Z             scale_ub_tensor = None
2025-05-07T20:33:13.5151976Z     
2025-05-07T20:33:13.5152220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.5152536Z             op = silu_mul_quant
2025-05-07T20:33:13.5152794Z             if compiled:
2025-05-07T20:33:13.5153053Z                 op = torch.compile(op)
2025-05-07T20:33:13.5153348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.5153629Z     
2025-05-07T20:33:13.5153831Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.5154114Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.5154412Z     
2025-05-07T20:33:13.5154656Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.5154990Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.5155289Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.5155616Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.5156089Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.5156406Z     
2025-05-07T20:33:13.5156615Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.5156810Z 
2025-05-07T20:33:13.5156928Z moe/activation_test.py:126: 
2025-05-07T20:33:13.5157222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.5157569Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.5157900Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.5158690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.5159443Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.5159992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.5161576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.5162267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.5163043Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.5163781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.5164420Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.5165015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.5165843Z     fn()
2025-05-07T20:33:13.5166355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.5166933Z     self.fn.run(
2025-05-07T20:33:13.5167412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.5168049Z     kernel = self.compile(
2025-05-07T20:33:13.5168593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.5169242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.5169647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.5169878Z 
2025-05-07T20:33:13.5170091Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f8b05700>
2025-05-07T20:33:13.5171172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.5172556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f3351c60>}
2025-05-07T20:33:13.5173901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.5174932Z context = <triton._C.libtriton.ir.context object at 0x7f33f9ad3870>
2025-05-07T20:33:13.5175222Z 
2025-05-07T20:33:13.5175395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.5175915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.5176394Z                            module_map=module_map)
2025-05-07T20:33:13.5176762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.5177128Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.5177397Z E       ^
2025-05-07T20:33:13.5177873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.5178330Z 
2025-05-07T20:33:13.5178751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.5179265Z 
2025-05-07T20:33:13.5179371Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.5179787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.5180195Z     T=2048,
2025-05-07T20:33:13.5180391Z     D=5120,
2025-05-07T20:33:13.5180581Z     scale_ub=1200.0,
2025-05-07T20:33:13.5180807Z     contiguous=True,
2025-05-07T20:33:13.5181034Z     compiled=False,
2025-05-07T20:33:13.5181237Z )
2025-05-07T20:33:14.2520866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.2521653Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:14.2522002Z 
2025-05-07T20:33:14.2522094Z     @given(
2025-05-07T20:33:14.2522728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.2523061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.2523380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.2523796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.2524132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.2524430Z     )
2025-05-07T20:33:14.2524776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.2525228Z     def test_silu_mul_quant(
2025-05-07T20:33:14.2525480Z         self,
2025-05-07T20:33:14.2525679Z         T: int,
2025-05-07T20:33:14.2525886Z         D: int,
2025-05-07T20:33:14.2526119Z         scale_ub: Optional[float],
2025-05-07T20:33:14.2526389Z         contiguous: bool,
2025-05-07T20:33:14.2526635Z         compiled: bool,
2025-05-07T20:33:14.2526868Z     ) -> None:
2025-05-07T20:33:14.2527091Z         torch.manual_seed(2025)
2025-05-07T20:33:14.2527349Z     
2025-05-07T20:33:14.2527726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.2528081Z     
2025-05-07T20:33:14.2528277Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.2528587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.2528911Z         x = x_sign * x_clamp
2025-05-07T20:33:14.2529152Z         x0 = x[:, :D]
2025-05-07T20:33:14.2529379Z         x1 = x[:, D:]
2025-05-07T20:33:14.2529594Z     
2025-05-07T20:33:14.2529780Z         if contiguous:
2025-05-07T20:33:14.2530024Z             x0 = x0.contiguous()
2025-05-07T20:33:14.2530293Z             x1 = x1.contiguous()
2025-05-07T20:33:14.2530535Z     
2025-05-07T20:33:14.2530735Z         if scale_ub is not None:
2025-05-07T20:33:14.2531018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.2531356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.2531674Z             )
2025-05-07T20:33:14.2531872Z         else:
2025-05-07T20:33:14.2532083Z             scale_ub_tensor = None
2025-05-07T20:33:14.2532338Z     
2025-05-07T20:33:14.2532579Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.2532907Z             op = silu_mul_quant
2025-05-07T20:33:14.2533155Z             if compiled:
2025-05-07T20:33:14.2533405Z                 op = torch.compile(op)
2025-05-07T20:33:14.2533702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.2533974Z     
2025-05-07T20:33:14.2534174Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.2534337Z 
2025-05-07T20:33:14.2534452Z moe/activation_test.py:117: 
2025-05-07T20:33:14.2534748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2535091Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.2535374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.2536062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.2536763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.2537302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.2537982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.2538634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.2539166Z     kernel = self.compile(
2025-05-07T20:33:14.2539705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.2540359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.2540760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2540999Z 
2025-05-07T20:33:14.2541202Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f3333440>
2025-05-07T20:33:14.2542339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.2543837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f31a8220>}
2025-05-07T20:33:14.2545180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.2546204Z context = <triton._C.libtriton.ir.context object at 0x7f33f351ba70>
2025-05-07T20:33:14.2546488Z 
2025-05-07T20:33:14.2546660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.2547185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.2547650Z                            module_map=module_map)
2025-05-07T20:33:14.2548058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.2548419Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.2548676Z E       ^
2025-05-07T20:33:14.2549141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.2549592Z 
2025-05-07T20:33:14.2550010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.2550516Z 
2025-05-07T20:33:14.2550626Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.2551032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.2551438Z     T=2048,
2025-05-07T20:33:14.2551632Z     D=5120,
2025-05-07T20:33:14.2551822Z     scale_ub=1200.0,
2025-05-07T20:33:14.2552047Z     contiguous=True,
2025-05-07T20:33:14.2552280Z     compiled=True,
2025-05-07T20:33:14.2552486Z )
2025-05-07T20:33:14.2552809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.2553303Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:14.2553575Z 
2025-05-07T20:33:14.2553662Z     @given(
2025-05-07T20:33:14.2553889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.2554203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.2554516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.2554843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.2555174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.2555462Z     )
2025-05-07T20:33:14.2555913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.2556356Z     def test_silu_mul_quant(
2025-05-07T20:33:14.2556606Z         self,
2025-05-07T20:33:14.2556799Z         T: int,
2025-05-07T20:33:14.2557008Z         D: int,
2025-05-07T20:33:14.2557235Z         scale_ub: Optional[float],
2025-05-07T20:33:14.2557503Z         contiguous: bool,
2025-05-07T20:33:14.2557746Z         compiled: bool,
2025-05-07T20:33:14.2557975Z     ) -> None:
2025-05-07T20:33:14.2558194Z         torch.manual_seed(2025)
2025-05-07T20:33:14.2558437Z     
2025-05-07T20:33:14.2558714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.2559059Z     
2025-05-07T20:33:14.2559249Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.2559537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.2559851Z         x = x_sign * x_clamp
2025-05-07T20:33:14.2560087Z         x0 = x[:, :D]
2025-05-07T20:33:14.2560307Z         x1 = x[:, D:]
2025-05-07T20:33:14.2560518Z     
2025-05-07T20:33:14.2560702Z         if contiguous:
2025-05-07T20:33:14.2560936Z             x0 = x0.contiguous()
2025-05-07T20:33:14.2561202Z             x1 = x1.contiguous()
2025-05-07T20:33:14.2561530Z     
2025-05-07T20:33:14.2561728Z         if scale_ub is not None:
2025-05-07T20:33:14.2562006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.2562338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.2562693Z             )
2025-05-07T20:33:14.2562885Z         else:
2025-05-07T20:33:14.2563110Z             scale_ub_tensor = None
2025-05-07T20:33:14.2563364Z     
2025-05-07T20:33:14.2563598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.2563913Z             op = silu_mul_quant
2025-05-07T20:33:14.2564169Z             if compiled:
2025-05-07T20:33:14.2564422Z                 op = torch.compile(op)
2025-05-07T20:33:14.2564723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.2564999Z     
2025-05-07T20:33:14.2565197Z         y_fp8, y_scale = fn()
2025-05-07T20:33:14.2565789Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:14.2566082Z     
2025-05-07T20:33:14.2566329Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.2566751Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:14.2567043Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:14.2567362Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:14.2567724Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.2568038Z     
2025-05-07T20:33:14.2568250Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:14.2568446Z 
2025-05-07T20:33:14.2568555Z moe/activation_test.py:126: 
2025-05-07T20:33:14.2568857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2569188Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:14.2569514Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.2570297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:14.2571046Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:14.2571597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.2572280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.2572966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:14.2573679Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.2574410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:14.2575054Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:14.2575658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:14.2576171Z     fn()
2025-05-07T20:33:14.2576683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:14.2577264Z     self.fn.run(
2025-05-07T20:33:14.2577724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.2578257Z     kernel = self.compile(
2025-05-07T20:33:14.2578795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.2579447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.2579843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2580077Z 
2025-05-07T20:33:14.2580284Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f31f4170>
2025-05-07T20:33:14.2581441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.2582870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f31a96c0>}
2025-05-07T20:33:14.2584258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.2585281Z context = <triton._C.libtriton.ir.context object at 0x7f33f1d62830>
2025-05-07T20:33:14.2585576Z 
2025-05-07T20:33:14.2585745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.2586272Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.2586740Z                            module_map=module_map)
2025-05-07T20:33:14.2587113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.2587486Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:14.2587802Z E       ^
2025-05-07T20:33:14.2588271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.2588734Z 
2025-05-07T20:33:14.2589145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.2589656Z 
2025-05-07T20:33:14.2589766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.2590177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.2590581Z     T=16384,
2025-05-07T20:33:14.2590779Z     D=7168,
2025-05-07T20:33:14.2590976Z     scale_ub=1200.0,
2025-05-07T20:33:14.2591210Z     contiguous=False,
2025-05-07T20:33:14.2591439Z     compiled=False,
2025-05-07T20:33:14.2591641Z )
2025-05-07T20:33:14.9933414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.9934372Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:14.9934859Z 
2025-05-07T20:33:14.9934982Z     @given(
2025-05-07T20:33:14.9935341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.9935845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.9936335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.9936856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.9937371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.9937827Z     )
2025-05-07T20:33:14.9938390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.9939113Z     def test_silu_mul_quant(
2025-05-07T20:33:14.9939497Z         self,
2025-05-07T20:33:14.9939803Z         T: int,
2025-05-07T20:33:14.9940104Z         D: int,
2025-05-07T20:33:14.9940444Z         scale_ub: Optional[float],
2025-05-07T20:33:14.9940881Z         contiguous: bool,
2025-05-07T20:33:14.9941271Z         compiled: bool,
2025-05-07T20:33:14.9941631Z     ) -> None:
2025-05-07T20:33:14.9941985Z         torch.manual_seed(2025)
2025-05-07T20:33:14.9942393Z     
2025-05-07T20:33:14.9942832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.9943394Z     
2025-05-07T20:33:14.9943696Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.9944153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.9944673Z         x = x_sign * x_clamp
2025-05-07T20:33:14.9945076Z         x0 = x[:, :D]
2025-05-07T20:33:14.9945416Z         x1 = x[:, D:]
2025-05-07T20:33:14.9945754Z     
2025-05-07T20:33:14.9946065Z         if contiguous:
2025-05-07T20:33:14.9946430Z             x0 = x0.contiguous()
2025-05-07T20:33:14.9946844Z             x1 = x1.contiguous()
2025-05-07T20:33:14.9947231Z     
2025-05-07T20:33:14.9947526Z         if scale_ub is not None:
2025-05-07T20:33:14.9948111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.9949164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.9949670Z             )
2025-05-07T20:33:14.9949968Z         else:
2025-05-07T20:33:14.9950295Z             scale_ub_tensor = None
2025-05-07T20:33:14.9950814Z     
2025-05-07T20:33:14.9951193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.9951700Z             op = silu_mul_quant
2025-05-07T20:33:14.9952110Z             if compiled:
2025-05-07T20:33:14.9952508Z                 op = torch.compile(op)
2025-05-07T20:33:14.9952954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9953332Z     
2025-05-07T20:33:14.9953596Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.9953841Z 
2025-05-07T20:33:14.9953993Z moe/activation_test.py:117: 
2025-05-07T20:33:14.9954570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9955202Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.9955668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9957084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.9958279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.9959209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.9960301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.9961397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.9962231Z     kernel = self.compile(
2025-05-07T20:33:14.9963139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.9964243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.9964940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9969588Z 
2025-05-07T20:33:14.9969833Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f32b6540>
2025-05-07T20:33:14.9970974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.9972378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f2058040>}
2025-05-07T20:33:14.9973729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.9974762Z context = <triton._C.libtriton.ir.context object at 0x7f33f1dbcf70>
2025-05-07T20:33:14.9975052Z 
2025-05-07T20:33:14.9975228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.9975765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.9976253Z                            module_map=module_map)
2025-05-07T20:33:14.9976620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.9976991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.9977263Z E       ^
2025-05-07T20:33:14.9977736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.9978190Z 
2025-05-07T20:33:14.9978607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.9979125Z 
2025-05-07T20:33:14.9979232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.9979651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.9980236Z     T=1,
2025-05-07T20:33:14.9980426Z     D=7168,
2025-05-07T20:33:14.9980632Z     scale_ub=None,
2025-05-07T20:33:14.9980856Z     contiguous=True,
2025-05-07T20:33:14.9981081Z     compiled=True,
2025-05-07T20:33:14.9981366Z )
2025-05-07T20:33:14.9981696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.9982180Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:14.9982445Z 
2025-05-07T20:33:14.9982525Z     @given(
2025-05-07T20:33:14.9982764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.9983087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.9983407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.9983754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.9984093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.9984387Z     )
2025-05-07T20:33:14.9984748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.9985203Z     def test_silu_mul_quant(
2025-05-07T20:33:14.9985507Z         self,
2025-05-07T20:33:14.9985713Z         T: int,
2025-05-07T20:33:14.9985918Z         D: int,
2025-05-07T20:33:14.9986140Z         scale_ub: Optional[float],
2025-05-07T20:33:14.9986424Z         contiguous: bool,
2025-05-07T20:33:14.9986670Z         compiled: bool,
2025-05-07T20:33:14.9986894Z     ) -> None:
2025-05-07T20:33:14.9987119Z         torch.manual_seed(2025)
2025-05-07T20:33:14.9987369Z     
2025-05-07T20:33:14.9987642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.9987997Z     
2025-05-07T20:33:14.9988198Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.9988490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.9988814Z         x = x_sign * x_clamp
2025-05-07T20:33:14.9989061Z         x0 = x[:, :D]
2025-05-07T20:33:14.9989285Z         x1 = x[:, D:]
2025-05-07T20:33:14.9989496Z     
2025-05-07T20:33:14.9989697Z         if contiguous:
2025-05-07T20:33:14.9989937Z             x0 = x0.contiguous()
2025-05-07T20:33:14.9990197Z             x1 = x1.contiguous()
2025-05-07T20:33:14.9990444Z     
2025-05-07T20:33:14.9990644Z         if scale_ub is not None:
2025-05-07T20:33:14.9990919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.9991267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.9991588Z             )
2025-05-07T20:33:14.9991789Z         else:
2025-05-07T20:33:14.9992009Z             scale_ub_tensor = None
2025-05-07T20:33:14.9992271Z     
2025-05-07T20:33:14.9992508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.9992832Z             op = silu_mul_quant
2025-05-07T20:33:14.9993094Z             if compiled:
2025-05-07T20:33:14.9999842Z                 op = torch.compile(op)
2025-05-07T20:33:15.0000198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.0000500Z     
2025-05-07T20:33:15.0000711Z         y_fp8, y_scale = fn()
2025-05-07T20:33:15.0001020Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:15.0001334Z     
2025-05-07T20:33:15.0001596Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.0001946Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:15.0002259Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:15.0002592Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:15.0002960Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.0003286Z     
2025-05-07T20:33:15.0003503Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:15.0003706Z 
2025-05-07T20:33:15.0003820Z moe/activation_test.py:126: 
2025-05-07T20:33:15.0004140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.0004493Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:15.0004836Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.0005782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:15.0006579Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:15.0007271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.0008083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.0008916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:15.0009789Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.0010674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:15.0011434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:15.0012206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:15.0012741Z     fn()
2025-05-07T20:33:15.0013266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:15.0013858Z     self.fn.run(
2025-05-07T20:33:15.0014342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.0014889Z     kernel = self.compile(
2025-05-07T20:33:15.0015435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.0016106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.0016523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.0016762Z 
2025-05-07T20:33:15.0016982Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2008bf0>
2025-05-07T20:33:15.0018081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.0019471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f2058ea0>}
2025-05-07T20:33:15.0020830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.0021866Z context = <triton._C.libtriton.ir.context object at 0x7f33f166d4f0>
2025-05-07T20:33:15.0022159Z 
2025-05-07T20:33:15.0022341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.0022872Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.0023362Z                            module_map=module_map)
2025-05-07T20:33:15.0023743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.0024113Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:15.0024395Z E       ^
2025-05-07T20:33:15.0024872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.0025332Z 
2025-05-07T20:33:15.0025783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.0026325Z 
2025-05-07T20:33:15.0026434Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.0026862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.0027278Z     T=4096,
2025-05-07T20:33:15.0027476Z     D=5120,
2025-05-07T20:33:15.0027683Z     scale_ub=None,
2025-05-07T20:33:15.0027915Z     contiguous=False,
2025-05-07T20:33:15.0028236Z     compiled=False,
2025-05-07T20:33:15.0028459Z )
2025-05-07T20:33:15.7989301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.7989985Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:15.7990270Z 
2025-05-07T20:33:15.7990372Z     @given(
2025-05-07T20:33:15.7990614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.7990946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.7991272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.7991613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.7991959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.7992262Z     )
2025-05-07T20:33:15.7992616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.7993065Z     def test_silu_mul_quant(
2025-05-07T20:33:15.7993318Z         self,
2025-05-07T20:33:15.7993534Z         T: int,
2025-05-07T20:33:15.7993737Z         D: int,
2025-05-07T20:33:15.7994068Z         scale_ub: Optional[float],
2025-05-07T20:33:15.7994360Z         contiguous: bool,
2025-05-07T20:33:15.7994612Z         compiled: bool,
2025-05-07T20:33:15.7994851Z     ) -> None:
2025-05-07T20:33:15.7995088Z         torch.manual_seed(2025)
2025-05-07T20:33:15.7995341Z     
2025-05-07T20:33:15.7995626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.7996062Z     
2025-05-07T20:33:15.7996293Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.7996597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.7996928Z         x = x_sign * x_clamp
2025-05-07T20:33:15.7997179Z         x0 = x[:, :D]
2025-05-07T20:33:15.7997414Z         x1 = x[:, D:]
2025-05-07T20:33:15.7997640Z     
2025-05-07T20:33:15.7997832Z         if contiguous:
2025-05-07T20:33:15.7998075Z             x0 = x0.contiguous()
2025-05-07T20:33:15.7998361Z             x1 = x1.contiguous()
2025-05-07T20:33:15.7998618Z     
2025-05-07T20:33:15.7998833Z         if scale_ub is not None:
2025-05-07T20:33:15.7999125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.7999476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.7999797Z             )
2025-05-07T20:33:15.8000011Z         else:
2025-05-07T20:33:15.8000237Z             scale_ub_tensor = None
2025-05-07T20:33:15.8000500Z     
2025-05-07T20:33:15.8000749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8001083Z             op = silu_mul_quant
2025-05-07T20:33:15.8001340Z             if compiled:
2025-05-07T20:33:15.8001602Z                 op = torch.compile(op)
2025-05-07T20:33:15.8001915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8002203Z     
2025-05-07T20:33:15.8002416Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.8002587Z 
2025-05-07T20:33:15.8002699Z moe/activation_test.py:117: 
2025-05-07T20:33:15.8003007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8003362Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.8003658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8004361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.8005057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.8005607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.8006304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.8006985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.8007525Z     kernel = self.compile(
2025-05-07T20:33:15.8008075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.8008872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.8009285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8009561Z 
2025-05-07T20:33:15.8009770Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1c73c20>
2025-05-07T20:33:15.8010863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.8012251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f317b240>}
2025-05-07T20:33:15.8013602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.8014684Z context = <triton._C.libtriton.ir.context object at 0x7f33f19df9f0>
2025-05-07T20:33:15.8014976Z 
2025-05-07T20:33:15.8015146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.8015681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.8016203Z                            module_map=module_map)
2025-05-07T20:33:15.8016580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.8016946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.8017214Z E       ^
2025-05-07T20:33:15.8017680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.8018139Z 
2025-05-07T20:33:15.8018556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.8019079Z 
2025-05-07T20:33:15.8019189Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8019620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8020026Z     T=4096,
2025-05-07T20:33:15.8020231Z     D=7168,
2025-05-07T20:33:15.8020434Z     scale_ub=None,
2025-05-07T20:33:15.8020655Z     contiguous=False,
2025-05-07T20:33:15.8020889Z     compiled=False,
2025-05-07T20:33:15.8021102Z )
2025-05-07T20:33:15.8021430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.8021928Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:15.8022216Z 
2025-05-07T20:33:15.8022299Z     @given(
2025-05-07T20:33:15.8022540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.8022860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.8023179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.8023520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.8023858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.8024160Z     )
2025-05-07T20:33:15.8024519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.8024971Z     def test_silu_mul_quant(
2025-05-07T20:33:15.8025216Z         self,
2025-05-07T20:33:15.8025422Z         T: int,
2025-05-07T20:33:15.8025629Z         D: int,
2025-05-07T20:33:15.8025852Z         scale_ub: Optional[float],
2025-05-07T20:33:15.8026163Z         contiguous: bool,
2025-05-07T20:33:15.8026437Z         compiled: bool,
2025-05-07T20:33:15.8026669Z     ) -> None:
2025-05-07T20:33:15.8026897Z         torch.manual_seed(2025)
2025-05-07T20:33:15.8027150Z     
2025-05-07T20:33:15.8027424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.8027778Z     
2025-05-07T20:33:15.8027980Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.8028275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.8028689Z         x = x_sign * x_clamp
2025-05-07T20:33:15.8028941Z         x0 = x[:, :D]
2025-05-07T20:33:15.8029165Z         x1 = x[:, D:]
2025-05-07T20:33:15.8029384Z     
2025-05-07T20:33:15.8029580Z         if contiguous:
2025-05-07T20:33:15.8029857Z             x0 = x0.contiguous()
2025-05-07T20:33:15.8030126Z             x1 = x1.contiguous()
2025-05-07T20:33:15.8030374Z     
2025-05-07T20:33:15.8030574Z         if scale_ub is not None:
2025-05-07T20:33:15.8030849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.8031193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.8031512Z             )
2025-05-07T20:33:15.8031706Z         else:
2025-05-07T20:33:15.8031925Z             scale_ub_tensor = None
2025-05-07T20:33:15.8032183Z     
2025-05-07T20:33:15.8032417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8032745Z             op = silu_mul_quant
2025-05-07T20:33:15.8033003Z             if compiled:
2025-05-07T20:33:15.8033258Z                 op = torch.compile(op)
2025-05-07T20:33:15.8033610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8033896Z     
2025-05-07T20:33:15.8034093Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.8034270Z 
2025-05-07T20:33:15.8034372Z moe/activation_test.py:117: 
2025-05-07T20:33:15.8034687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8035025Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.8035315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8036085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.8036832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.8037376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.8038068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.8038758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.8039292Z     kernel = self.compile(
2025-05-07T20:33:15.8039849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.8040514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.8040925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8041160Z 
2025-05-07T20:33:15.8041370Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2082870>
2025-05-07T20:33:15.8042464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.8043855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f25c0>}
2025-05-07T20:33:15.8045209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.8046253Z context = <triton._C.libtriton.ir.context object at 0x7f33f19356b0>
2025-05-07T20:33:15.8046545Z 
2025-05-07T20:33:15.8046716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.8047260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.8047741Z                            module_map=module_map)
2025-05-07T20:33:15.8048106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.8048475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.8048749Z E       ^
2025-05-07T20:33:15.8049318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.8049778Z 
2025-05-07T20:33:15.8050192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.8050753Z 
2025-05-07T20:33:15.8050863Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8051288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8051703Z     T=128,
2025-05-07T20:33:15.8051895Z     D=7168,
2025-05-07T20:33:15.8052099Z     scale_ub=None,
2025-05-07T20:33:15.8052320Z     contiguous=False,
2025-05-07T20:33:15.8052546Z     compiled=True,
2025-05-07T20:33:15.8052759Z )
2025-05-07T20:33:15.8608962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.8609546Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:15.8609900Z 
2025-05-07T20:33:15.8609997Z     @given(
2025-05-07T20:33:15.8610325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.8610661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.8610975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.8611316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.8611661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.8611957Z     )
2025-05-07T20:33:15.8612308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.8612761Z     def test_silu_mul_quant(
2025-05-07T20:33:15.8613021Z         self,
2025-05-07T20:33:15.8613223Z         T: int,
2025-05-07T20:33:15.8613439Z         D: int,
2025-05-07T20:33:15.8613672Z         scale_ub: Optional[float],
2025-05-07T20:33:15.8613955Z         contiguous: bool,
2025-05-07T20:33:15.8614207Z         compiled: bool,
2025-05-07T20:33:15.8614444Z     ) -> None:
2025-05-07T20:33:15.8614672Z         torch.manual_seed(2025)
2025-05-07T20:33:15.8614932Z     
2025-05-07T20:33:15.8615223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.8615570Z     
2025-05-07T20:33:15.8615775Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.8616080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.8616407Z         x = x_sign * x_clamp
2025-05-07T20:33:15.8616649Z         x0 = x[:, :D]
2025-05-07T20:33:15.8616869Z         x1 = x[:, D:]
2025-05-07T20:33:15.8617082Z     
2025-05-07T20:33:15.8617273Z         if contiguous:
2025-05-07T20:33:15.8617514Z             x0 = x0.contiguous()
2025-05-07T20:33:15.8617782Z             x1 = x1.contiguous()
2025-05-07T20:33:15.8618032Z     
2025-05-07T20:33:15.8618230Z         if scale_ub is not None:
2025-05-07T20:33:15.8618503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.8618836Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.8619152Z             )
2025-05-07T20:33:15.8619355Z         else:
2025-05-07T20:33:15.8619570Z             scale_ub_tensor = None
2025-05-07T20:33:15.8619831Z     
2025-05-07T20:33:15.8620068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8620384Z             op = silu_mul_quant
2025-05-07T20:33:15.8620637Z             if compiled:
2025-05-07T20:33:15.8620889Z                 op = torch.compile(op)
2025-05-07T20:33:15.8621185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8621462Z     
2025-05-07T20:33:15.8621659Z         y_fp8, y_scale = fn()
2025-05-07T20:33:15.8621947Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:15.8622237Z     
2025-05-07T20:33:15.8622480Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8622820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:15.8623112Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:15.8623430Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:15.8623922Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.8624235Z     
2025-05-07T20:33:15.8624445Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:15.8624646Z 
2025-05-07T20:33:15.8624836Z moe/activation_test.py:126: 
2025-05-07T20:33:15.8625137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8625472Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:15.8625804Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.8626593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:15.8627341Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:15.8627889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.8628571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.8629313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:15.8630040Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.8630780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:15.8631421Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:15.8632033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:15.8632549Z     fn()
2025-05-07T20:33:15.8633058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:15.8633644Z     self.fn.run(
2025-05-07T20:33:15.8634109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.8634648Z     kernel = self.compile(
2025-05-07T20:33:15.8635193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.8635906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.8636356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8636600Z 
2025-05-07T20:33:15.8636811Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2082570>
2025-05-07T20:33:15.8637896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.8639274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f31a0>}
2025-05-07T20:33:15.8640617Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.8641648Z context = <triton._C.libtriton.ir.context object at 0x7f33f176c4b0>
2025-05-07T20:33:15.8641940Z 
2025-05-07T20:33:15.8642109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.8642633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.8643102Z                            module_map=module_map)
2025-05-07T20:33:15.8643469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.8643833Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:15.8644098Z E       ^
2025-05-07T20:33:15.8644566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.8645075Z 
2025-05-07T20:33:15.8645529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.8646041Z 
2025-05-07T20:33:15.8646215Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8646655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8647062Z     T=128,
2025-05-07T20:33:15.8647254Z     D=7168,
2025-05-07T20:33:15.8647444Z     scale_ub=None,
2025-05-07T20:33:15.8647662Z     contiguous=False,
2025-05-07T20:33:15.8647890Z     compiled=False,
2025-05-07T20:33:15.8648097Z )
2025-05-07T20:33:16.0645837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0646369Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:16.0646738Z 
2025-05-07T20:33:16.0646858Z     @given(
2025-05-07T20:33:16.0647200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0647570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0648037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0648386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0648733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0649025Z     )
2025-05-07T20:33:16.0649388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0649847Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0650096Z         self,
2025-05-07T20:33:16.0650331Z         T: int,
2025-05-07T20:33:16.0650531Z         D: int,
2025-05-07T20:33:16.0650758Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0651039Z         contiguous: bool,
2025-05-07T20:33:16.0651292Z         compiled: bool,
2025-05-07T20:33:16.0651524Z     ) -> None:
2025-05-07T20:33:16.0651757Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0652016Z     
2025-05-07T20:33:16.0652293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0652662Z     
2025-05-07T20:33:16.0652871Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0653166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0653494Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0653753Z         x0 = x[:, :D]
2025-05-07T20:33:16.0653975Z         x1 = x[:, D:]
2025-05-07T20:33:16.0654199Z     
2025-05-07T20:33:16.0654394Z         if contiguous:
2025-05-07T20:33:16.0654632Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0654902Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0655158Z     
2025-05-07T20:33:16.0655354Z         if scale_ub is not None:
2025-05-07T20:33:16.0655644Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0655991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0656304Z             )
2025-05-07T20:33:16.0656519Z         else:
2025-05-07T20:33:16.0656738Z             scale_ub_tensor = None
2025-05-07T20:33:16.0657002Z     
2025-05-07T20:33:16.0657241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0657578Z             op = silu_mul_quant
2025-05-07T20:33:16.0657845Z             if compiled:
2025-05-07T20:33:16.0658094Z                 op = torch.compile(op)
2025-05-07T20:33:16.0658400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0658682Z     
2025-05-07T20:33:16.0658882Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0659057Z 
2025-05-07T20:33:16.0659163Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0659467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0659805Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0660095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0667063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0667771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0668440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0669184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0669915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0670462Z     kernel = self.compile(
2025-05-07T20:33:16.0671012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0671663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0672068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0672302Z 
2025-05-07T20:33:16.0672517Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1660590>
2025-05-07T20:33:16.0673653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0675035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a645e0>}
2025-05-07T20:33:16.0676436Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0677464Z context = <triton._C.libtriton.ir.context object at 0x7f33f17a6db0>
2025-05-07T20:33:16.0677751Z 
2025-05-07T20:33:16.0677933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0678455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0678932Z                            module_map=module_map)
2025-05-07T20:33:16.0679312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0679678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0679936Z E       ^
2025-05-07T20:33:16.0680407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0680866Z 
2025-05-07T20:33:16.0681290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0681802Z 
2025-05-07T20:33:16.0681908Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0682331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0682740Z     T=4096,
2025-05-07T20:33:16.0682934Z     D=5120,
2025-05-07T20:33:16.0683130Z     scale_ub=1200.0,
2025-05-07T20:33:16.0683358Z     contiguous=True,
2025-05-07T20:33:16.0683588Z     compiled=False,
2025-05-07T20:33:16.0683793Z )
2025-05-07T20:33:16.0684123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0684629Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:16.0684903Z 
2025-05-07T20:33:16.0684984Z     @given(
2025-05-07T20:33:16.0685222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0685542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0685848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0686185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0686523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0686813Z     )
2025-05-07T20:33:16.0687159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0687608Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0687859Z         self,
2025-05-07T20:33:16.0688050Z         T: int,
2025-05-07T20:33:16.0688251Z         D: int,
2025-05-07T20:33:16.0688472Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0688831Z         contiguous: bool,
2025-05-07T20:33:16.0689082Z         compiled: bool,
2025-05-07T20:33:16.0689315Z     ) -> None:
2025-05-07T20:33:16.0689529Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0689818Z     
2025-05-07T20:33:16.0690103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0690446Z     
2025-05-07T20:33:16.0690648Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0690942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0691252Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0691506Z         x0 = x[:, :D]
2025-05-07T20:33:16.0691734Z         x1 = x[:, D:]
2025-05-07T20:33:16.0691950Z     
2025-05-07T20:33:16.0692138Z         if contiguous:
2025-05-07T20:33:16.0692379Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0692643Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0692874Z     
2025-05-07T20:33:16.0693075Z         if scale_ub is not None:
2025-05-07T20:33:16.0693360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0693740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0694061Z             )
2025-05-07T20:33:16.0694264Z         else:
2025-05-07T20:33:16.0694484Z             scale_ub_tensor = None
2025-05-07T20:33:16.0694741Z     
2025-05-07T20:33:16.0694976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0695287Z             op = silu_mul_quant
2025-05-07T20:33:16.0695546Z             if compiled:
2025-05-07T20:33:16.0695796Z                 op = torch.compile(op)
2025-05-07T20:33:16.0696090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0696411Z     
2025-05-07T20:33:16.0696623Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0696787Z 
2025-05-07T20:33:16.0696897Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0697189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0697524Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0697815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0698499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0699191Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0699727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0700411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0701068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0701601Z     kernel = self.compile(
2025-05-07T20:33:16.0702142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0702796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0703199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0703435Z 
2025-05-07T20:33:16.0703643Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1663fb0>
2025-05-07T20:33:16.0704724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0706087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a651c0>}
2025-05-07T20:33:16.0707476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0708501Z context = <triton._C.libtriton.ir.context object at 0x7f33f0ab3ab0>
2025-05-07T20:33:16.0708874Z 
2025-05-07T20:33:16.0709054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0709584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0710095Z                            module_map=module_map)
2025-05-07T20:33:16.0710471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0710831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0711096Z E       ^
2025-05-07T20:33:16.0711568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0712018Z 
2025-05-07T20:33:16.0712438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0712946Z 
2025-05-07T20:33:16.0713060Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0713472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0713886Z     T=1,
2025-05-07T20:33:16.0714125Z     D=5120,
2025-05-07T20:33:16.0714319Z     scale_ub=None,
2025-05-07T20:33:16.0714543Z     contiguous=True,
2025-05-07T20:33:16.0714776Z     compiled=True,
2025-05-07T20:33:16.0714979Z )
2025-05-07T20:33:16.4488718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.4489245Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:16.4489510Z 
2025-05-07T20:33:16.4489626Z     @given(
2025-05-07T20:33:16.4489865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.4490187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.4490498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.4490829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.4491167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.4491449Z     )
2025-05-07T20:33:16.4491817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.4492273Z     def test_silu_mul_quant(
2025-05-07T20:33:16.4492515Z         self,
2025-05-07T20:33:16.4492717Z         T: int,
2025-05-07T20:33:16.4492923Z         D: int,
2025-05-07T20:33:16.4493144Z         scale_ub: Optional[float],
2025-05-07T20:33:16.4493411Z         contiguous: bool,
2025-05-07T20:33:16.4493659Z         compiled: bool,
2025-05-07T20:33:16.4493886Z     ) -> None:
2025-05-07T20:33:16.4494103Z         torch.manual_seed(2025)
2025-05-07T20:33:16.4494353Z     
2025-05-07T20:33:16.4494629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.4494971Z     
2025-05-07T20:33:16.4495168Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.4495463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.4495778Z         x = x_sign * x_clamp
2025-05-07T20:33:16.4496022Z         x0 = x[:, :D]
2025-05-07T20:33:16.4496250Z         x1 = x[:, D:]
2025-05-07T20:33:16.4496497Z     
2025-05-07T20:33:16.4496698Z         if contiguous:
2025-05-07T20:33:16.4496937Z             x0 = x0.contiguous()
2025-05-07T20:33:16.4497190Z             x1 = x1.contiguous()
2025-05-07T20:33:16.4497431Z     
2025-05-07T20:33:16.4497627Z         if scale_ub is not None:
2025-05-07T20:33:16.4497901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.4498238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.4498561Z             )
2025-05-07T20:33:16.4498761Z         else:
2025-05-07T20:33:16.4498980Z             scale_ub_tensor = None
2025-05-07T20:33:16.4499234Z     
2025-05-07T20:33:16.4499479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4499798Z             op = silu_mul_quant
2025-05-07T20:33:16.4500060Z             if compiled:
2025-05-07T20:33:16.4500312Z                 op = torch.compile(op)
2025-05-07T20:33:16.4500613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4501027Z     
2025-05-07T20:33:16.4501290Z         y_fp8, y_scale = fn()
2025-05-07T20:33:16.4501586Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:16.4501880Z     
2025-05-07T20:33:16.4502122Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4502524Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:16.4502820Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:16.4503138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:16.4503506Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:16.4503815Z     
2025-05-07T20:33:16.4504019Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:16.4504212Z 
2025-05-07T20:33:16.4504326Z moe/activation_test.py:126: 
2025-05-07T20:33:16.4504627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4504965Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:16.4505301Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:16.4506177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:16.4506988Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:16.4507532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.4508213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.4508895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:16.4509617Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:16.4510348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:16.4510984Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:16.4511587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:16.4512105Z     fn()
2025-05-07T20:33:16.4512613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:16.4513193Z     self.fn.run(
2025-05-07T20:33:16.4513659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.4514190Z     kernel = self.compile(
2025-05-07T20:33:16.4514726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.4515374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.4515869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4516100Z 
2025-05-07T20:33:16.4516318Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f0f68c80>
2025-05-07T20:33:16.4517403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.4518792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a66a20>}
2025-05-07T20:33:16.4520140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.4521166Z context = <triton._C.libtriton.ir.context object at 0x7f33f0aa04f0>
2025-05-07T20:33:16.4521451Z 
2025-05-07T20:33:16.4521623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.4522196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.4522713Z                            module_map=module_map)
2025-05-07T20:33:16.4523080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.4523514Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:16.4523788Z E       ^
2025-05-07T20:33:16.4524251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.4524702Z 
2025-05-07T20:33:16.4525121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.4525628Z 
2025-05-07T20:33:16.4525732Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.4526152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.4526605Z     T=2048,
2025-05-07T20:33:16.4526791Z     D=5120,
2025-05-07T20:33:16.4526987Z     scale_ub=None,
2025-05-07T20:33:16.4527213Z     contiguous=True,
2025-05-07T20:33:16.4527445Z     compiled=True,
2025-05-07T20:33:16.4527691Z )
2025-05-07T20:33:16.8166998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.8167751Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:16.8168130Z 
2025-05-07T20:33:16.8168243Z     @given(
2025-05-07T20:33:16.8168528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.8168841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.8169151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.8169485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.8169808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.8170094Z     )
2025-05-07T20:33:16.8170450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.8170888Z     def test_silu_mul_quant(
2025-05-07T20:33:16.8171143Z         self,
2025-05-07T20:33:16.8171341Z         T: int,
2025-05-07T20:33:16.8171543Z         D: int,
2025-05-07T20:33:16.8171758Z         scale_ub: Optional[float],
2025-05-07T20:33:16.8172030Z         contiguous: bool,
2025-05-07T20:33:16.8172279Z         compiled: bool,
2025-05-07T20:33:16.8172501Z     ) -> None:
2025-05-07T20:33:16.8172728Z         torch.manual_seed(2025)
2025-05-07T20:33:16.8172981Z     
2025-05-07T20:33:16.8173248Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.8173597Z     
2025-05-07T20:33:16.8173800Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.8174086Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.8174404Z         x = x_sign * x_clamp
2025-05-07T20:33:16.8174653Z         x0 = x[:, :D]
2025-05-07T20:33:16.8174874Z         x1 = x[:, D:]
2025-05-07T20:33:16.8175085Z     
2025-05-07T20:33:16.8175281Z         if contiguous:
2025-05-07T20:33:16.8175513Z             x0 = x0.contiguous()
2025-05-07T20:33:16.8175782Z             x1 = x1.contiguous()
2025-05-07T20:33:16.8176028Z     
2025-05-07T20:33:16.8176218Z         if scale_ub is not None:
2025-05-07T20:33:16.8176490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.8176835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.8177149Z             )
2025-05-07T20:33:16.8177347Z         else:
2025-05-07T20:33:16.8177561Z             scale_ub_tensor = None
2025-05-07T20:33:16.8177813Z     
2025-05-07T20:33:16.8178040Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8178355Z             op = silu_mul_quant
2025-05-07T20:33:16.8178608Z             if compiled:
2025-05-07T20:33:16.8178860Z                 op = torch.compile(op)
2025-05-07T20:33:16.8179162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8179443Z     
2025-05-07T20:33:16.8179636Z         y_fp8, y_scale = fn()
2025-05-07T20:33:16.8179928Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:16.8180402Z     
2025-05-07T20:33:16.8180643Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8180980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:16.8181274Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:16.8181675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:16.8182033Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:16.8182344Z     
2025-05-07T20:33:16.8182550Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:16.8182744Z 
2025-05-07T20:33:16.8182847Z moe/activation_test.py:126: 
2025-05-07T20:33:16.8183146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8183490Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:16.8183812Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:16.8184601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:16.8185412Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:16.8185959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.8186640Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.8187325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:16.8188048Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:16.8188777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:16.8189410Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:16.8190007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:16.8190529Z     fn()
2025-05-07T20:33:16.8191033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:16.8191616Z     self.fn.run(
2025-05-07T20:33:16.8192085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.8192615Z     kernel = self.compile(
2025-05-07T20:33:16.8193146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.8193795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.8194191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8194420Z 
2025-05-07T20:33:16.8194627Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f146d070>
2025-05-07T20:33:16.8195791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8197218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a4a020>}
2025-05-07T20:33:16.8198568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8199592Z context = <triton._C.libtriton.ir.context object at 0x7f33f0bebf70>
2025-05-07T20:33:16.8199876Z 
2025-05-07T20:33:16.8200042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8200563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8201032Z                            module_map=module_map)
2025-05-07T20:33:16.8201489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8201846Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:16.8202114Z E       ^
2025-05-07T20:33:16.8202618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8203067Z 
2025-05-07T20:33:16.8203478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.8203989Z 
2025-05-07T20:33:16.8204093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.8204507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.8204910Z     T=128,
2025-05-07T20:33:16.8205099Z     D=5120,
2025-05-07T20:33:16.8205294Z     scale_ub=None,
2025-05-07T20:33:16.8205510Z     contiguous=True,
2025-05-07T20:33:16.8205729Z     compiled=True,
2025-05-07T20:33:16.8205934Z )
2025-05-07T20:33:17.2437404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.2438317Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:17.2438680Z 
2025-05-07T20:33:17.2438779Z     @given(
2025-05-07T20:33:17.2439017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.2439336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.2439655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.2439993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.2440328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.2440627Z     )
2025-05-07T20:33:17.2440992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.2441439Z     def test_silu_mul_quant(
2025-05-07T20:33:17.2441689Z         self,
2025-05-07T20:33:17.2441893Z         T: int,
2025-05-07T20:33:17.2442098Z         D: int,
2025-05-07T20:33:17.2442331Z         scale_ub: Optional[float],
2025-05-07T20:33:17.2442619Z         contiguous: bool,
2025-05-07T20:33:17.2442865Z         compiled: bool,
2025-05-07T20:33:17.2443096Z     ) -> None:
2025-05-07T20:33:17.2443320Z         torch.manual_seed(2025)
2025-05-07T20:33:17.2443576Z     
2025-05-07T20:33:17.2443848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.2444200Z     
2025-05-07T20:33:17.2444399Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.2444692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.2445010Z         x = x_sign * x_clamp
2025-05-07T20:33:17.2445259Z         x0 = x[:, :D]
2025-05-07T20:33:17.2445477Z         x1 = x[:, D:]
2025-05-07T20:33:17.2445699Z     
2025-05-07T20:33:17.2445893Z         if contiguous:
2025-05-07T20:33:17.2446131Z             x0 = x0.contiguous()
2025-05-07T20:33:17.2446399Z             x1 = x1.contiguous()
2025-05-07T20:33:17.2446649Z     
2025-05-07T20:33:17.2446849Z         if scale_ub is not None:
2025-05-07T20:33:17.2447136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.2447490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.2447810Z             )
2025-05-07T20:33:17.2448017Z         else:
2025-05-07T20:33:17.2448238Z             scale_ub_tensor = None
2025-05-07T20:33:17.2448500Z     
2025-05-07T20:33:17.2448739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.2449071Z             op = silu_mul_quant
2025-05-07T20:33:17.2449331Z             if compiled:
2025-05-07T20:33:17.2449589Z                 op = torch.compile(op)
2025-05-07T20:33:17.2449896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.2450181Z     
2025-05-07T20:33:17.2450377Z         y_fp8, y_scale = fn()
2025-05-07T20:33:17.2450672Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:17.2450970Z     
2025-05-07T20:33:17.2451212Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.2451635Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:17.2451991Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:17.2452313Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:17.2452683Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.2453070Z     
2025-05-07T20:33:17.2453281Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:17.2453480Z 
2025-05-07T20:33:17.2453584Z moe/activation_test.py:126: 
2025-05-07T20:33:17.2453889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.2454236Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:17.2454568Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.2455366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:17.2456126Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:17.2461587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.2462359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.2463064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:17.2463797Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:17.2464529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:17.2465176Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:17.2466052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:17.2466573Z     fn()
2025-05-07T20:33:17.2467146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:17.2467737Z     self.fn.run(
2025-05-07T20:33:17.2468212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.2468754Z     kernel = self.compile(
2025-05-07T20:33:17.2469302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.2469967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.2470370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.2470616Z 
2025-05-07T20:33:17.2470825Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1a11850>
2025-05-07T20:33:17.2471919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.2473308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f18400e0>}
2025-05-07T20:33:17.2474663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.2475687Z context = <triton._C.libtriton.ir.context object at 0x7f33f0dfa8b0>
2025-05-07T20:33:17.2476063Z 
2025-05-07T20:33:17.2476232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.2476765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.2477248Z                            module_map=module_map)
2025-05-07T20:33:17.2477614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.2477983Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:17.2478408Z E       ^
2025-05-07T20:33:17.2478880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.2479397Z 
2025-05-07T20:33:17.2479812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.2480327Z 
2025-05-07T20:33:17.2480435Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.2480860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.2481266Z     T=4096,
2025-05-07T20:33:17.2481464Z     D=5120,
2025-05-07T20:33:17.2481664Z     scale_ub=None,
2025-05-07T20:33:17.2481880Z     contiguous=True,
2025-05-07T20:33:17.2482111Z     compiled=True,
2025-05-07T20:33:17.2482322Z )
2025-05-07T20:33:17.6735880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.6736663Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:17.6737060Z 
2025-05-07T20:33:17.6737182Z     @given(
2025-05-07T20:33:17.6737614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.6737942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.6738266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.6738611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.6738953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.6739248Z     )
2025-05-07T20:33:17.6739597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.6740046Z     def test_silu_mul_quant(
2025-05-07T20:33:17.6740296Z         self,
2025-05-07T20:33:17.6740496Z         T: int,
2025-05-07T20:33:17.6740704Z         D: int,
2025-05-07T20:33:17.6740932Z         scale_ub: Optional[float],
2025-05-07T20:33:17.6741209Z         contiguous: bool,
2025-05-07T20:33:17.6741460Z         compiled: bool,
2025-05-07T20:33:17.6741701Z     ) -> None:
2025-05-07T20:33:17.6741925Z         torch.manual_seed(2025)
2025-05-07T20:33:17.6742184Z     
2025-05-07T20:33:17.6742473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.6742825Z     
2025-05-07T20:33:17.6743029Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.6743337Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.6743670Z         x = x_sign * x_clamp
2025-05-07T20:33:17.6743920Z         x0 = x[:, :D]
2025-05-07T20:33:17.6744151Z         x1 = x[:, D:]
2025-05-07T20:33:17.6744367Z     
2025-05-07T20:33:17.6744554Z         if contiguous:
2025-05-07T20:33:17.6744796Z             x0 = x0.contiguous()
2025-05-07T20:33:17.6745072Z             x1 = x1.contiguous()
2025-05-07T20:33:17.6745311Z     
2025-05-07T20:33:17.6745516Z         if scale_ub is not None:
2025-05-07T20:33:17.6745802Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.6746154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.6746488Z             )
2025-05-07T20:33:17.6746701Z         else:
2025-05-07T20:33:17.6746924Z             scale_ub_tensor = None
2025-05-07T20:33:17.6747229Z     
2025-05-07T20:33:17.6747490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6747821Z             op = silu_mul_quant
2025-05-07T20:33:17.6748082Z             if compiled:
2025-05-07T20:33:17.6748347Z                 op = torch.compile(op)
2025-05-07T20:33:17.6748662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6748942Z     
2025-05-07T20:33:17.6749142Z         y_fp8, y_scale = fn()
2025-05-07T20:33:17.6749440Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:17.6749738Z     
2025-05-07T20:33:17.6749979Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6750327Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:17.6750628Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:17.6751033Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:17.6751491Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.6751804Z     
2025-05-07T20:33:17.6752012Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:17.6752281Z 
2025-05-07T20:33:17.6752392Z moe/activation_test.py:126: 
2025-05-07T20:33:17.6752699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6753039Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:17.6753376Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.6754171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:17.6754924Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:17.6755475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.6756254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.6757003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:17.6757734Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:17.6758474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:17.6759123Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:17.6759733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:17.6760254Z     fn()
2025-05-07T20:33:17.6760772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:17.6761366Z     self.fn.run(
2025-05-07T20:33:17.6761837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.6762379Z     kernel = self.compile(
2025-05-07T20:33:17.6762930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.6763598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.6764002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6764240Z 
2025-05-07T20:33:17.6764450Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f09138c0>
2025-05-07T20:33:17.6765712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.6767127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f0d2b1a0>}
2025-05-07T20:33:17.6768502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.6769536Z context = <triton._C.libtriton.ir.context object at 0x7f33f1255170>
2025-05-07T20:33:17.6769832Z 
2025-05-07T20:33:17.6770003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.6770533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.6771003Z                            module_map=module_map)
2025-05-07T20:33:17.6771376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.6771742Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:17.6772012Z E       ^
2025-05-07T20:33:17.6772557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.6773066Z 
2025-05-07T20:33:17.6773485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.6774054Z 
2025-05-07T20:33:17.6774165Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.6774579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.6774988Z     T=16384,
2025-05-07T20:33:17.6775190Z     D=5120,
2025-05-07T20:33:17.6775385Z     scale_ub=None,
2025-05-07T20:33:17.6775609Z     contiguous=True,
2025-05-07T20:33:17.6775839Z     compiled=True,
2025-05-07T20:33:17.6776043Z )
2025-05-07T20:33:17.7032333Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:17.7033669Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:17.7035106Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:17.7036179Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:17.7037285Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:17.7922589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.7923808Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:17.7924377Z 
2025-05-07T20:33:17.7924543Z     @given(
2025-05-07T20:33:17.7925015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.7925660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.7926285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.7926956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.7927295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.7927590Z     )
2025-05-07T20:33:17.7927946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.7928397Z     def test_silu_mul_quant(
2025-05-07T20:33:17.7928642Z         self,
2025-05-07T20:33:17.7928845Z         T: int,
2025-05-07T20:33:17.7929050Z         D: int,
2025-05-07T20:33:17.7929269Z         scale_ub: Optional[float],
2025-05-07T20:33:17.7929546Z         contiguous: bool,
2025-05-07T20:33:17.7929792Z         compiled: bool,
2025-05-07T20:33:17.7930016Z     ) -> None:
2025-05-07T20:33:17.7930240Z         torch.manual_seed(2025)
2025-05-07T20:33:17.7930488Z     
2025-05-07T20:33:17.7930762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.7931115Z     
2025-05-07T20:33:17.7931316Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.7931608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.7931926Z         x = x_sign * x_clamp
2025-05-07T20:33:17.7932173Z         x0 = x[:, :D]
2025-05-07T20:33:17.7932392Z         x1 = x[:, D:]
2025-05-07T20:33:17.7932606Z     
2025-05-07T20:33:17.7932798Z         if contiguous:
2025-05-07T20:33:17.7933031Z             x0 = x0.contiguous()
2025-05-07T20:33:17.7933295Z             x1 = x1.contiguous()
2025-05-07T20:33:17.7933544Z     
2025-05-07T20:33:17.7933743Z         if scale_ub is not None:
2025-05-07T20:33:17.7934017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.7934357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.7934673Z             )
2025-05-07T20:33:17.7934866Z         else:
2025-05-07T20:33:17.7935082Z             scale_ub_tensor = None
2025-05-07T20:33:17.7935571Z     
2025-05-07T20:33:17.7935807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.7936132Z             op = silu_mul_quant
2025-05-07T20:33:17.7936389Z             if compiled:
2025-05-07T20:33:17.7936704Z                 op = torch.compile(op)
2025-05-07T20:33:17.7937010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.7937293Z     
2025-05-07T20:33:17.7937490Z         y_fp8, y_scale = fn()
2025-05-07T20:33:17.7937789Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:17.7938084Z     
2025-05-07T20:33:17.7938322Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.7938663Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:17.7938962Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:17.7939276Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:17.7939643Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.7939961Z     
2025-05-07T20:33:17.7940176Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:17.7940373Z 
2025-05-07T20:33:17.7940540Z moe/activation_test.py:126: 
2025-05-07T20:33:17.7940844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.7941190Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:17.7941518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:17.7942309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:17.7943069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:17.7943620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.7944300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.7944994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:17.7945730Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:17.7946469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:17.7947107Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:17.7947714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:17.7948237Z     fn()
2025-05-07T20:33:17.7948741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:17.7949328Z     self.fn.run(
2025-05-07T20:33:17.7949799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.7950336Z     kernel = self.compile(
2025-05-07T20:33:17.7950877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.7951538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.7951944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.7952180Z 
2025-05-07T20:33:17.7952386Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f090aba0>
2025-05-07T20:33:17.7953475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.7954853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f0171300>}
2025-05-07T20:33:17.7956318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.7957438Z context = <triton._C.libtriton.ir.context object at 0x7f33f03b1c30>
2025-05-07T20:33:17.7957769Z 
2025-05-07T20:33:17.7957941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.7958477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.7958952Z                            module_map=module_map)
2025-05-07T20:33:17.7959323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.7959690Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:17.7959964Z E       ^
2025-05-07T20:33:17.7960439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.7960890Z 
2025-05-07T20:33:17.7961309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.7961828Z 
2025-05-07T20:33:17.7961979Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.7962402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.7962814Z     T=1,
2025-05-07T20:33:17.7963000Z     D=5120,
2025-05-07T20:33:17.7963203Z     scale_ub=1200.0,
2025-05-07T20:33:17.7963435Z     contiguous=True,
2025-05-07T20:33:17.7963658Z     compiled=True,
2025-05-07T20:33:17.7963870Z )
2025-05-07T20:33:17.9384366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.9385041Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.9385399Z 
2025-05-07T20:33:17.9385481Z     @given(
2025-05-07T20:33:17.9385717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.9386035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.9386346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.9386685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.9387028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.9387318Z     )
2025-05-07T20:33:17.9387664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.9388110Z     def test_silu_mul_quant(
2025-05-07T20:33:17.9388353Z         self,
2025-05-07T20:33:17.9388548Z         T: int,
2025-05-07T20:33:17.9388752Z         D: int,
2025-05-07T20:33:17.9388979Z         scale_ub: Optional[float],
2025-05-07T20:33:17.9389250Z         contiguous: bool,
2025-05-07T20:33:17.9389493Z         compiled: bool,
2025-05-07T20:33:17.9389718Z     ) -> None:
2025-05-07T20:33:17.9389933Z         torch.manual_seed(2025)
2025-05-07T20:33:17.9390176Z     
2025-05-07T20:33:17.9390451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.9390793Z     
2025-05-07T20:33:17.9390984Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.9391282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.9391597Z         x = x_sign * x_clamp
2025-05-07T20:33:17.9391839Z         x0 = x[:, :D]
2025-05-07T20:33:17.9392060Z         x1 = x[:, D:]
2025-05-07T20:33:17.9392275Z     
2025-05-07T20:33:17.9392459Z         if contiguous:
2025-05-07T20:33:17.9392693Z             x0 = x0.contiguous()
2025-05-07T20:33:17.9392951Z             x1 = x1.contiguous()
2025-05-07T20:33:17.9393186Z     
2025-05-07T20:33:17.9393386Z         if scale_ub is not None:
2025-05-07T20:33:17.9393660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.9393993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.9394304Z             )
2025-05-07T20:33:17.9394500Z         else:
2025-05-07T20:33:17.9394707Z             scale_ub_tensor = None
2025-05-07T20:33:17.9394958Z     
2025-05-07T20:33:17.9395195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.9395510Z             op = silu_mul_quant
2025-05-07T20:33:17.9396083Z             if compiled:
2025-05-07T20:33:17.9396340Z                 op = torch.compile(op)
2025-05-07T20:33:17.9396641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9396974Z     
2025-05-07T20:33:17.9397173Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.9397337Z 
2025-05-07T20:33:17.9397442Z moe/activation_test.py:117: 
2025-05-07T20:33:17.9397733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9398063Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.9398348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9398898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.9399458Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.9400115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.9400806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.9401400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.9402082Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.9402745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.9403275Z     kernel = self.compile(
2025-05-07T20:33:17.9403808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.9404463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.9404860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9405088Z 
2025-05-07T20:33:17.9405293Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f090ac00>
2025-05-07T20:33:17.9406375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.9407757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2c720>}
2025-05-07T20:33:17.9409098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.9410122Z context = <triton._C.libtriton.ir.context object at 0x7f32cfac4a30>
2025-05-07T20:33:17.9410408Z 
2025-05-07T20:33:17.9410572Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.9411092Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.9411565Z                            module_map=module_map)
2025-05-07T20:33:17.9411925Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.9412282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.9412550Z E       ^
2025-05-07T20:33:17.9413014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.9413463Z 
2025-05-07T20:33:17.9413872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.9414384Z 
2025-05-07T20:33:17.9414488Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.9414900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.9415302Z     T=1,
2025-05-07T20:33:17.9415481Z     D=5120,
2025-05-07T20:33:17.9415673Z     scale_ub=None,
2025-05-07T20:33:17.9415888Z     contiguous=False,
2025-05-07T20:33:17.9416106Z     compiled=True,
2025-05-07T20:33:17.9416399Z )
2025-05-07T20:33:18.1557380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.1557960Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:18.1558512Z 
2025-05-07T20:33:18.1558637Z     @given(
2025-05-07T20:33:18.1558951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.1559301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.1559614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.1559939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.1560271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.1560563Z     )
2025-05-07T20:33:18.1560916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.1561365Z     def test_silu_mul_quant(
2025-05-07T20:33:18.1561614Z         self,
2025-05-07T20:33:18.1561805Z         T: int,
2025-05-07T20:33:18.1562016Z         D: int,
2025-05-07T20:33:18.1562236Z         scale_ub: Optional[float],
2025-05-07T20:33:18.1562580Z         contiguous: bool,
2025-05-07T20:33:18.1562826Z         compiled: bool,
2025-05-07T20:33:18.1563058Z     ) -> None:
2025-05-07T20:33:18.1563279Z         torch.manual_seed(2025)
2025-05-07T20:33:18.1563519Z     
2025-05-07T20:33:18.1563795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.1564155Z     
2025-05-07T20:33:18.1564345Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.1564647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.1564958Z         x = x_sign * x_clamp
2025-05-07T20:33:18.1565195Z         x0 = x[:, :D]
2025-05-07T20:33:18.1565714Z         x1 = x[:, D:]
2025-05-07T20:33:18.1565929Z     
2025-05-07T20:33:18.1566114Z         if contiguous:
2025-05-07T20:33:18.1566350Z             x0 = x0.contiguous()
2025-05-07T20:33:18.1566614Z             x1 = x1.contiguous()
2025-05-07T20:33:18.1566850Z     
2025-05-07T20:33:18.1567062Z         if scale_ub is not None:
2025-05-07T20:33:18.1567364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.1567724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.1568041Z             )
2025-05-07T20:33:18.1568237Z         else:
2025-05-07T20:33:18.1568450Z             scale_ub_tensor = None
2025-05-07T20:33:18.1568700Z     
2025-05-07T20:33:18.1568939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1569262Z             op = silu_mul_quant
2025-05-07T20:33:18.1569514Z             if compiled:
2025-05-07T20:33:18.1569764Z                 op = torch.compile(op)
2025-05-07T20:33:18.1570064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1570451Z     
2025-05-07T20:33:18.1570745Z         y_fp8, y_scale = fn()
2025-05-07T20:33:18.1576700Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:18.1577027Z     
2025-05-07T20:33:18.1577287Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1577635Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:18.1577945Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:18.1578278Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:18.1578645Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.1578958Z     
2025-05-07T20:33:18.1579169Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:18.1579367Z 
2025-05-07T20:33:18.1579484Z moe/activation_test.py:126: 
2025-05-07T20:33:18.1579787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1580136Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:18.1580471Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.1581258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:18.1582136Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:18.1582743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.1583430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.1584205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:18.1584938Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:18.1585671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:18.1586312Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:18.1586913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:18.1587443Z     fn()
2025-05-07T20:33:18.1587965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:18.1588604Z     self.fn.run(
2025-05-07T20:33:18.1589075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.1589609Z     kernel = self.compile(
2025-05-07T20:33:18.1590150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.1590796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.1591198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1591428Z 
2025-05-07T20:33:18.1591640Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f01a85c0>
2025-05-07T20:33:18.1592719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.1594099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2e8e0>}
2025-05-07T20:33:18.1595447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.1596566Z context = <triton._C.libtriton.ir.context object at 0x7f32cfa1d7f0>
2025-05-07T20:33:18.1596855Z 
2025-05-07T20:33:18.1597031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.1597597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.1598072Z                            module_map=module_map)
2025-05-07T20:33:18.1598438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.1598802Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:18.1599065Z E       ^
2025-05-07T20:33:18.1599535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.1599989Z 
2025-05-07T20:33:18.1600412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.1600923Z 
2025-05-07T20:33:18.1601040Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.1601453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.1601868Z     T=1,
2025-05-07T20:33:18.1602058Z     D=5120,
2025-05-07T20:33:18.1602252Z     scale_ub=None,
2025-05-07T20:33:18.1602471Z     contiguous=True,
2025-05-07T20:33:18.1602699Z     compiled=False,
2025-05-07T20:33:18.1602903Z )
2025-05-07T20:33:18.3111665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3112652Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.3113043Z 
2025-05-07T20:33:18.3113172Z     @given(
2025-05-07T20:33:18.3113484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3113989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3114307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3114643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3114968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3115259Z     )
2025-05-07T20:33:18.3115621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3116160Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3116407Z         self,
2025-05-07T20:33:18.3116607Z         T: int,
2025-05-07T20:33:18.3116803Z         D: int,
2025-05-07T20:33:18.3117022Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3117305Z         contiguous: bool,
2025-05-07T20:33:18.3117553Z         compiled: bool,
2025-05-07T20:33:18.3117786Z     ) -> None:
2025-05-07T20:33:18.3118079Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3118341Z     
2025-05-07T20:33:18.3118642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3119039Z     
2025-05-07T20:33:18.3119248Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.3119536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.3119846Z         x = x_sign * x_clamp
2025-05-07T20:33:18.3120085Z         x0 = x[:, :D]
2025-05-07T20:33:18.3120297Z         x1 = x[:, D:]
2025-05-07T20:33:18.3120511Z     
2025-05-07T20:33:18.3120704Z         if contiguous:
2025-05-07T20:33:18.3120935Z             x0 = x0.contiguous()
2025-05-07T20:33:18.3121199Z             x1 = x1.contiguous()
2025-05-07T20:33:18.3121441Z     
2025-05-07T20:33:18.3121634Z         if scale_ub is not None:
2025-05-07T20:33:18.3121912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.3122253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.3122564Z             )
2025-05-07T20:33:18.3122765Z         else:
2025-05-07T20:33:18.3122983Z             scale_ub_tensor = None
2025-05-07T20:33:18.3123237Z     
2025-05-07T20:33:18.3123474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.3123795Z             op = silu_mul_quant
2025-05-07T20:33:18.3124042Z             if compiled:
2025-05-07T20:33:18.3124295Z                 op = torch.compile(op)
2025-05-07T20:33:18.3124591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3124866Z     
2025-05-07T20:33:18.3125056Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.3125228Z 
2025-05-07T20:33:18.3125327Z moe/activation_test.py:117: 
2025-05-07T20:33:18.3125624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3125952Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.3126236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3126934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.3127628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.3128157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.3128842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.3129505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.3130033Z     kernel = self.compile(
2025-05-07T20:33:18.3130576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.3131233Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.3131637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3131958Z 
2025-05-07T20:33:18.3132165Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1836420>
2025-05-07T20:33:18.3133250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.3134663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2dda0>}
2025-05-07T20:33:18.3136005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.3137029Z context = <triton._C.libtriton.ir.context object at 0x7f32cf81e570>
2025-05-07T20:33:18.3137317Z 
2025-05-07T20:33:18.3137490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.3138053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.3138528Z                            module_map=module_map)
2025-05-07T20:33:18.3138893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.3139245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.3139502Z E       ^
2025-05-07T20:33:18.3139963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.3140411Z 
2025-05-07T20:33:18.3140822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.3141337Z 
2025-05-07T20:33:18.3141441Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.3141851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.3142251Z     T=128,
2025-05-07T20:33:18.3142442Z     D=5120,
2025-05-07T20:33:18.3142643Z     scale_ub=None,
2025-05-07T20:33:18.3142862Z     contiguous=False,
2025-05-07T20:33:18.3143081Z     compiled=True,
2025-05-07T20:33:18.3143287Z )
2025-05-07T20:33:18.3143609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3144096Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:18.3144373Z 
2025-05-07T20:33:18.3144454Z     @given(
2025-05-07T20:33:18.3144687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3144999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3145303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3145629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3145957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3146236Z     )
2025-05-07T20:33:18.3146586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3147037Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3147280Z         self,
2025-05-07T20:33:18.3147484Z         T: int,
2025-05-07T20:33:18.3147732Z         D: int,
2025-05-07T20:33:18.3147973Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3148246Z         contiguous: bool,
2025-05-07T20:33:18.3148490Z         compiled: bool,
2025-05-07T20:33:18.3148714Z     ) -> None:
2025-05-07T20:33:18.3148924Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3149165Z     
2025-05-07T20:33:18.3149439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3149773Z     
2025-05-07T20:33:18.3149966Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.3150257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.3150566Z         x = x_sign * x_clamp
2025-05-07T20:33:18.3150800Z         x0 = x[:, :D]
2025-05-07T20:33:18.3151017Z         x1 = x[:, D:]
2025-05-07T20:33:18.3151229Z     
2025-05-07T20:33:18.3151410Z         if contiguous:
2025-05-07T20:33:18.3151741Z             x0 = x0.contiguous()
2025-05-07T20:33:18.3152002Z             x1 = x1.contiguous()
2025-05-07T20:33:18.3152241Z     
2025-05-07T20:33:18.3152442Z         if scale_ub is not None:
2025-05-07T20:33:18.3152761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.3153091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.3153402Z             )
2025-05-07T20:33:18.3153598Z         else:
2025-05-07T20:33:18.3153803Z             scale_ub_tensor = None
2025-05-07T20:33:18.3154050Z     
2025-05-07T20:33:18.3154288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.3154598Z             op = silu_mul_quant
2025-05-07T20:33:18.3154849Z             if compiled:
2025-05-07T20:33:18.3155094Z                 op = torch.compile(op)
2025-05-07T20:33:18.3155395Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3155665Z     
2025-05-07T20:33:18.3155909Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.3156078Z 
2025-05-07T20:33:18.3156180Z moe/activation_test.py:117: 
2025-05-07T20:33:18.3156515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3156852Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.3157138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3157685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.3158246Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.3158905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.3159588Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.3160127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.3160810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.3161479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.3162009Z     kernel = self.compile(
2025-05-07T20:33:18.3162549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.3163204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.3163598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3163827Z 
2025-05-07T20:33:18.3164034Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cfd47020>
2025-05-07T20:33:18.3165112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.3166746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f01a3920>}
2025-05-07T20:33:18.3168082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.3169098Z context = <triton._C.libtriton.ir.context object at 0x7f32cf8bffb0>
2025-05-07T20:33:18.3169393Z 
2025-05-07T20:33:18.3169562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.3170083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.3170547Z                            module_map=module_map)
2025-05-07T20:33:18.3170903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.3171256Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.3171512Z E       ^
2025-05-07T20:33:18.3172052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.3172595Z 
2025-05-07T20:33:18.3173006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.3173579Z 
2025-05-07T20:33:18.3173684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.3174093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.3174485Z     T=128,
2025-05-07T20:33:18.3174677Z     D=7168,
2025-05-07T20:33:18.3174869Z     scale_ub=1200.0,
2025-05-07T20:33:18.3175085Z     contiguous=False,
2025-05-07T20:33:18.3175308Z     compiled=False,
2025-05-07T20:33:18.3175514Z )
2025-05-07T20:33:18.4315049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4315891Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:18.4316270Z 
2025-05-07T20:33:18.4316388Z     @given(
2025-05-07T20:33:18.4316709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4317243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4317877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4318531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4319184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4319741Z     )
2025-05-07T20:33:18.4320428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4321316Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4321792Z         self,
2025-05-07T20:33:18.4322169Z         T: int,
2025-05-07T20:33:18.4322551Z         D: int,
2025-05-07T20:33:18.4322979Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4323500Z         contiguous: bool,
2025-05-07T20:33:18.4323975Z         compiled: bool,
2025-05-07T20:33:18.4324420Z     ) -> None:
2025-05-07T20:33:18.4325168Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4325657Z     
2025-05-07T20:33:18.4326196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4326870Z     
2025-05-07T20:33:18.4327252Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4327595Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4327924Z         x = x_sign * x_clamp
2025-05-07T20:33:18.4328162Z         x0 = x[:, :D]
2025-05-07T20:33:18.4328382Z         x1 = x[:, D:]
2025-05-07T20:33:18.4328584Z     
2025-05-07T20:33:18.4328770Z         if contiguous:
2025-05-07T20:33:18.4329001Z             x0 = x0.contiguous()
2025-05-07T20:33:18.4329254Z             x1 = x1.contiguous()
2025-05-07T20:33:18.4329496Z     
2025-05-07T20:33:18.4329690Z         if scale_ub is not None:
2025-05-07T20:33:18.4329962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.4330287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.4330594Z             )
2025-05-07T20:33:18.4330790Z         else:
2025-05-07T20:33:18.4331000Z             scale_ub_tensor = None
2025-05-07T20:33:18.4331258Z     
2025-05-07T20:33:18.4331490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.4331796Z             op = silu_mul_quant
2025-05-07T20:33:18.4332056Z             if compiled:
2025-05-07T20:33:18.4332304Z                 op = torch.compile(op)
2025-05-07T20:33:18.4332592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4332864Z     
2025-05-07T20:33:18.4333058Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.4333220Z 
2025-05-07T20:33:18.4333318Z moe/activation_test.py:117: 
2025-05-07T20:33:18.4333611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4333944Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.4334227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4334984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.4335727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.4336264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.4336995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.4337649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.4338175Z     kernel = self.compile(
2025-05-07T20:33:18.4338711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.4339361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.4339757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4339984Z 
2025-05-07T20:33:18.4340192Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cfd454f0>
2025-05-07T20:33:18.4341319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.4342684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f16c0>}
2025-05-07T20:33:18.4344019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.4345041Z context = <triton._C.libtriton.ir.context object at 0x7f32cf820bb0>
2025-05-07T20:33:18.4345327Z 
2025-05-07T20:33:18.4345498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.4346012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.4346484Z                            module_map=module_map)
2025-05-07T20:33:18.4346849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.4347208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.4347485Z E       ^
2025-05-07T20:33:18.4347985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.4348435Z 
2025-05-07T20:33:18.4348852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.4349361Z 
2025-05-07T20:33:18.4349472Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4349882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4350288Z     T=128,
2025-05-07T20:33:18.4350480Z     D=5120,
2025-05-07T20:33:18.4350673Z     scale_ub=None,
2025-05-07T20:33:18.4350893Z     contiguous=False,
2025-05-07T20:33:18.4351125Z     compiled=False,
2025-05-07T20:33:18.4351325Z )
2025-05-07T20:33:18.4351645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4352136Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.4352407Z 
2025-05-07T20:33:18.4352485Z     @given(
2025-05-07T20:33:18.4352716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4353038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4353349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4353673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4354001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4354285Z     )
2025-05-07T20:33:18.4354626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4355071Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4355312Z         self,
2025-05-07T20:33:18.4355597Z         T: int,
2025-05-07T20:33:18.4355846Z         D: int,
2025-05-07T20:33:18.4356072Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4356339Z         contiguous: bool,
2025-05-07T20:33:18.4356577Z         compiled: bool,
2025-05-07T20:33:18.4356848Z     ) -> None:
2025-05-07T20:33:18.4357060Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4357307Z     
2025-05-07T20:33:18.4357577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4357915Z     
2025-05-07T20:33:18.4358102Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4358389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4358704Z         x = x_sign * x_clamp
2025-05-07T20:33:18.4358942Z         x0 = x[:, :D]
2025-05-07T20:33:18.4359152Z         x1 = x[:, D:]
2025-05-07T20:33:18.4359360Z     
2025-05-07T20:33:18.4359547Z         if contiguous:
2025-05-07T20:33:18.4359777Z             x0 = x0.contiguous()
2025-05-07T20:33:18.4360038Z             x1 = x1.contiguous()
2025-05-07T20:33:18.4360279Z     
2025-05-07T20:33:18.4360491Z         if scale_ub is not None:
2025-05-07T20:33:18.4360811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.4361151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.4361458Z             )
2025-05-07T20:33:18.4361655Z         else:
2025-05-07T20:33:18.4361860Z             scale_ub_tensor = None
2025-05-07T20:33:18.4362111Z     
2025-05-07T20:33:18.4362346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.4362657Z             op = silu_mul_quant
2025-05-07T20:33:18.4362903Z             if compiled:
2025-05-07T20:33:18.4363147Z                 op = torch.compile(op)
2025-05-07T20:33:18.4363445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4363715Z     
2025-05-07T20:33:18.4363907Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.4364068Z 
2025-05-07T20:33:18.4364167Z moe/activation_test.py:117: 
2025-05-07T20:33:18.4364461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4364793Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.4365073Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4366059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.4366878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.4367499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.4368308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.4369090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.4369705Z     kernel = self.compile(
2025-05-07T20:33:18.4370330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.4371108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.4371557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4371826Z 
2025-05-07T20:33:18.4372058Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf841a00>
2025-05-07T20:33:18.4373371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.4375067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd3fba0>}
2025-05-07T20:33:18.4376799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.4377871Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9c1eb0>
2025-05-07T20:33:18.4378161Z 
2025-05-07T20:33:18.4378326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.4378912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.4379383Z                            module_map=module_map)
2025-05-07T20:33:18.4379741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.4380092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.4380348Z E       ^
2025-05-07T20:33:18.4380801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.4381256Z 
2025-05-07T20:33:18.4381666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.4382176Z 
2025-05-07T20:33:18.4382285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4382761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4383166Z     T=128,
2025-05-07T20:33:18.4383358Z     D=5120,
2025-05-07T20:33:18.4383550Z     scale_ub=1200.0,
2025-05-07T20:33:18.4383771Z     contiguous=True,
2025-05-07T20:33:18.4383989Z     compiled=False,
2025-05-07T20:33:18.4384326Z )
2025-05-07T20:33:18.6111289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6112005Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.6112403Z 
2025-05-07T20:33:18.6112515Z     @given(
2025-05-07T20:33:18.6112838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6113277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6113610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6113948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6114293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6114579Z     )
2025-05-07T20:33:18.6114943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6115403Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6115648Z         self,
2025-05-07T20:33:18.6115936Z         T: int,
2025-05-07T20:33:18.6116146Z         D: int,
2025-05-07T20:33:18.6116369Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6116638Z         contiguous: bool,
2025-05-07T20:33:18.6116879Z         compiled: bool,
2025-05-07T20:33:18.6117113Z     ) -> None:
2025-05-07T20:33:18.6117328Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6117571Z     
2025-05-07T20:33:18.6117848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6118194Z     
2025-05-07T20:33:18.6118389Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.6118682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.6119003Z         x = x_sign * x_clamp
2025-05-07T20:33:18.6119252Z         x0 = x[:, :D]
2025-05-07T20:33:18.6119474Z         x1 = x[:, D:]
2025-05-07T20:33:18.6119680Z     
2025-05-07T20:33:18.6119865Z         if contiguous:
2025-05-07T20:33:18.6120106Z             x0 = x0.contiguous()
2025-05-07T20:33:18.6120359Z             x1 = x1.contiguous()
2025-05-07T20:33:18.6120603Z     
2025-05-07T20:33:18.6120798Z         if scale_ub is not None:
2025-05-07T20:33:18.6121067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.6121400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.6121716Z             )
2025-05-07T20:33:18.6121912Z         else:
2025-05-07T20:33:18.6122123Z             scale_ub_tensor = None
2025-05-07T20:33:18.6122377Z     
2025-05-07T20:33:18.6122614Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.6122927Z             op = silu_mul_quant
2025-05-07T20:33:18.6123174Z             if compiled:
2025-05-07T20:33:18.6123539Z                 op = torch.compile(op)
2025-05-07T20:33:18.6123898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.6124172Z     
2025-05-07T20:33:18.6124368Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.6124593Z 
2025-05-07T20:33:18.6124691Z moe/activation_test.py:117: 
2025-05-07T20:33:18.6124997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.6125328Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.6125614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.6126299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.6126987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.6127529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.6128253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.6128985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.6129521Z     kernel = self.compile(
2025-05-07T20:33:18.6130063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.6130711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.6131117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.6131348Z 
2025-05-07T20:33:18.6131560Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843500>
2025-05-07T20:33:18.6132640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.6134020Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078cb80>}
2025-05-07T20:33:18.6135365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.6136399Z context = <triton._C.libtriton.ir.context object at 0x7f33f07b78f0>
2025-05-07T20:33:18.6136688Z 
2025-05-07T20:33:18.6136862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.6137380Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.6137850Z                            module_map=module_map)
2025-05-07T20:33:18.6138221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.6138577Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.6138834Z E       ^
2025-05-07T20:33:18.6139308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.6139761Z 
2025-05-07T20:33:18.6140184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.6140697Z 
2025-05-07T20:33:18.6140800Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6141216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6141621Z     T=1,
2025-05-07T20:33:18.6141809Z     D=7168,
2025-05-07T20:33:18.6142000Z     scale_ub=1200.0,
2025-05-07T20:33:18.6142225Z     contiguous=True,
2025-05-07T20:33:18.6142448Z     compiled=True,
2025-05-07T20:33:18.6142652Z )
2025-05-07T20:33:18.6142973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6143467Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.6143725Z 
2025-05-07T20:33:18.6143902Z     @given(
2025-05-07T20:33:18.6144136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6144456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6144763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6145142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6145476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6145762Z     )
2025-05-07T20:33:18.6146103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6146547Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6146788Z         self,
2025-05-07T20:33:18.6146988Z         T: int,
2025-05-07T20:33:18.6147197Z         D: int,
2025-05-07T20:33:18.6147419Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6147717Z         contiguous: bool,
2025-05-07T20:33:18.6147980Z         compiled: bool,
2025-05-07T20:33:18.6148207Z     ) -> None:
2025-05-07T20:33:18.6148417Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6148673Z     
2025-05-07T20:33:18.6149023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6149367Z     
2025-05-07T20:33:18.6149561Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.6149852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.6150166Z         x = x_sign * x_clamp
2025-05-07T20:33:18.6150405Z         x0 = x[:, :D]
2025-05-07T20:33:18.6150619Z         x1 = x[:, D:]
2025-05-07T20:33:18.6150835Z     
2025-05-07T20:33:18.6151022Z         if contiguous:
2025-05-07T20:33:18.6151248Z             x0 = x0.contiguous()
2025-05-07T20:33:18.6151510Z             x1 = x1.contiguous()
2025-05-07T20:33:18.6151748Z     
2025-05-07T20:33:18.6151933Z         if scale_ub is not None:
2025-05-07T20:33:18.6152203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.6152539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.6152844Z             )
2025-05-07T20:33:18.6153033Z         else:
2025-05-07T20:33:18.6153252Z             scale_ub_tensor = None
2025-05-07T20:33:18.6153497Z     
2025-05-07T20:33:18.6153748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.6154069Z             op = silu_mul_quant
2025-05-07T20:33:18.6154315Z             if compiled:
2025-05-07T20:33:18.6154559Z                 op = torch.compile(op)
2025-05-07T20:33:18.6154854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.6155128Z     
2025-05-07T20:33:18.6155311Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.6155477Z 
2025-05-07T20:33:18.6155579Z moe/activation_test.py:117: 
2025-05-07T20:33:18.6155931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.6156261Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.6156534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.6157088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.6157650Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.6158303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.6158994Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.6159531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.6160213Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.6160878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.6161413Z     kernel = self.compile(
2025-05-07T20:33:18.6161960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.6162609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.6163060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.6163334Z 
2025-05-07T20:33:18.6163546Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843050>
2025-05-07T20:33:18.6164678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.6166405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078e2a0>}
2025-05-07T20:33:18.6168057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.6169306Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9162f0>
2025-05-07T20:33:18.6169655Z 
2025-05-07T20:33:18.6169926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.6170447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.6170910Z                            module_map=module_map)
2025-05-07T20:33:18.6171273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.6171628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.6171886Z E       ^
2025-05-07T20:33:18.6172353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.6172806Z 
2025-05-07T20:33:18.6173219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.6173728Z 
2025-05-07T20:33:18.6173838Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6174244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6174650Z     T=1,
2025-05-07T20:33:18.6174834Z     D=7168,
2025-05-07T20:33:18.6175022Z     scale_ub=1200.0,
2025-05-07T20:33:18.6175245Z     contiguous=False,
2025-05-07T20:33:18.6175467Z     compiled=True,
2025-05-07T20:33:18.6175664Z )
2025-05-07T20:33:18.7493156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7493905Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.7494268Z 
2025-05-07T20:33:18.7494380Z     @given(
2025-05-07T20:33:18.7494690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7495096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7495493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7495896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7496221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7496505Z     )
2025-05-07T20:33:18.7496860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7497301Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7497551Z         self,
2025-05-07T20:33:18.7497752Z         T: int,
2025-05-07T20:33:18.7497953Z         D: int,
2025-05-07T20:33:18.7498179Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7498454Z         contiguous: bool,
2025-05-07T20:33:18.7498693Z         compiled: bool,
2025-05-07T20:33:18.7498923Z     ) -> None:
2025-05-07T20:33:18.7499145Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7499383Z     
2025-05-07T20:33:18.7499659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7500009Z     
2025-05-07T20:33:18.7500204Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.7500493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.7500808Z         x = x_sign * x_clamp
2025-05-07T20:33:18.7501053Z         x0 = x[:, :D]
2025-05-07T20:33:18.7501267Z         x1 = x[:, D:]
2025-05-07T20:33:18.7501655Z     
2025-05-07T20:33:18.7501844Z         if contiguous:
2025-05-07T20:33:18.7502074Z             x0 = x0.contiguous()
2025-05-07T20:33:18.7502331Z             x1 = x1.contiguous()
2025-05-07T20:33:18.7502633Z     
2025-05-07T20:33:18.7502826Z         if scale_ub is not None:
2025-05-07T20:33:18.7503098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.7503440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.7503748Z             )
2025-05-07T20:33:18.7503947Z         else:
2025-05-07T20:33:18.7504161Z             scale_ub_tensor = None
2025-05-07T20:33:18.7504410Z     
2025-05-07T20:33:18.7504646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.7504962Z             op = silu_mul_quant
2025-05-07T20:33:18.7505221Z             if compiled:
2025-05-07T20:33:18.7505464Z                 op = torch.compile(op)
2025-05-07T20:33:18.7505760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.7506043Z     
2025-05-07T20:33:18.7506236Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.7506405Z 
2025-05-07T20:33:18.7506571Z moe/activation_test.py:117: 
2025-05-07T20:33:18.7506871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.7507206Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.7507485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.7508042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.7508601Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.7509251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.7509939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.7510476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.7511152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.7511817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.7512351Z     kernel = self.compile(
2025-05-07T20:33:18.7512897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.7513550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.7513952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.7514181Z 
2025-05-07T20:33:18.7514392Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843e60>
2025-05-07T20:33:18.7515474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.7516947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078f9c0>}
2025-05-07T20:33:18.7518289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.7519314Z context = <triton._C.libtriton.ir.context object at 0x7f32cffed670>
2025-05-07T20:33:18.7519603Z 
2025-05-07T20:33:18.7519772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.7520291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.7520763Z                            module_map=module_map)
2025-05-07T20:33:18.7521128Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.7521486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.7521835Z E       ^
2025-05-07T20:33:18.7522309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.7522760Z 
2025-05-07T20:33:18.7523221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.7523731Z 
2025-05-07T20:33:18.7523838Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7524260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7524664Z     T=1,
2025-05-07T20:33:18.7524849Z     D=7168,
2025-05-07T20:33:18.7525040Z     scale_ub=None,
2025-05-07T20:33:18.7525257Z     contiguous=False,
2025-05-07T20:33:18.7525486Z     compiled=True,
2025-05-07T20:33:18.7525684Z )
2025-05-07T20:33:18.8391496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.8392206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:18.8392621Z 
2025-05-07T20:33:18.8392749Z     @given(
2025-05-07T20:33:18.8393216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.8393658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.8394090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.8394467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.8394793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.8395078Z     )
2025-05-07T20:33:18.8395437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.8395946Z     def test_silu_mul_quant(
2025-05-07T20:33:18.8396192Z         self,
2025-05-07T20:33:18.8396389Z         T: int,
2025-05-07T20:33:18.8396581Z         D: int,
2025-05-07T20:33:18.8396799Z         scale_ub: Optional[float],
2025-05-07T20:33:18.8397070Z         contiguous: bool,
2025-05-07T20:33:18.8397310Z         compiled: bool,
2025-05-07T20:33:18.8397537Z     ) -> None:
2025-05-07T20:33:18.8397765Z         torch.manual_seed(2025)
2025-05-07T20:33:18.8398047Z     
2025-05-07T20:33:18.8398329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.8398674Z     
2025-05-07T20:33:18.8398864Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.8399158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.8399472Z         x = x_sign * x_clamp
2025-05-07T20:33:18.8399717Z         x0 = x[:, :D]
2025-05-07T20:33:18.8399930Z         x1 = x[:, D:]
2025-05-07T20:33:18.8400139Z     
2025-05-07T20:33:18.8400327Z         if contiguous:
2025-05-07T20:33:18.8400555Z             x0 = x0.contiguous()
2025-05-07T20:33:18.8400817Z             x1 = x1.contiguous()
2025-05-07T20:33:18.8401058Z     
2025-05-07T20:33:18.8401253Z         if scale_ub is not None:
2025-05-07T20:33:18.8401527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.8401867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.8402184Z             )
2025-05-07T20:33:18.8402379Z         else:
2025-05-07T20:33:18.8402594Z             scale_ub_tensor = None
2025-05-07T20:33:18.8402850Z     
2025-05-07T20:33:18.8403090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.8403407Z             op = silu_mul_quant
2025-05-07T20:33:18.8403654Z             if compiled:
2025-05-07T20:33:18.8403905Z                 op = torch.compile(op)
2025-05-07T20:33:18.8404199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.8404477Z     
2025-05-07T20:33:18.8404667Z         y_fp8, y_scale = fn()
2025-05-07T20:33:18.8404954Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:18.8405251Z     
2025-05-07T20:33:18.8405486Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.8405822Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:18.8406116Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:18.8406518Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:18.8406938Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.8407252Z     
2025-05-07T20:33:18.8407454Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:18.8407769Z 
2025-05-07T20:33:18.8407894Z moe/activation_test.py:126: 
2025-05-07T20:33:18.8408192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.8408531Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:18.8408855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.8409643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:18.8410392Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:18.8410928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.8411612Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.8412342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:18.8413065Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:18.8413794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:18.8414430Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:18.8415037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:18.8415558Z     fn()
2025-05-07T20:33:18.8416061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:18.8416646Z     self.fn.run(
2025-05-07T20:33:18.8417123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.8417671Z     kernel = self.compile(
2025-05-07T20:33:18.8418213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.8418869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.8419270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.8419502Z 
2025-05-07T20:33:18.8419711Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffeb170>
2025-05-07T20:33:18.8420793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.8422175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5cb80>}
2025-05-07T20:33:18.8423519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.8424543Z context = <triton._C.libtriton.ir.context object at 0x7f32cffb5f70>
2025-05-07T20:33:18.8424832Z 
2025-05-07T20:33:18.8424999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.8425520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.8425994Z                            module_map=module_map)
2025-05-07T20:33:18.8426356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.8426715Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:18.8426983Z E       ^
2025-05-07T20:33:18.8427509Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.8428045Z 
2025-05-07T20:33:18.8428463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.8429021Z 
2025-05-07T20:33:18.8429126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.8429537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.8429941Z     T=1,
2025-05-07T20:33:18.8430126Z     D=5120,
2025-05-07T20:33:18.8430328Z     scale_ub=1200.0,
2025-05-07T20:33:18.8430553Z     contiguous=False,
2025-05-07T20:33:18.8430774Z     compiled=True,
2025-05-07T20:33:18.8430983Z )
2025-05-07T20:33:18.9989365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.9990069Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.9990440Z 
2025-05-07T20:33:18.9990551Z     @given(
2025-05-07T20:33:18.9990863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.9991253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.9991674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.9992012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.9992348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.9992634Z     )
2025-05-07T20:33:18.9992978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.9993421Z     def test_silu_mul_quant(
2025-05-07T20:33:18.9993669Z         self,
2025-05-07T20:33:18.9993861Z         T: int,
2025-05-07T20:33:18.9994064Z         D: int,
2025-05-07T20:33:18.9994282Z         scale_ub: Optional[float],
2025-05-07T20:33:18.9994549Z         contiguous: bool,
2025-05-07T20:33:18.9994796Z         compiled: bool,
2025-05-07T20:33:18.9995019Z     ) -> None:
2025-05-07T20:33:18.9995233Z         torch.manual_seed(2025)
2025-05-07T20:33:18.9995483Z     
2025-05-07T20:33:18.9995850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.9996195Z     
2025-05-07T20:33:19.0002620Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0003024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0003357Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0003599Z         x0 = x[:, :D]
2025-05-07T20:33:19.0003829Z         x1 = x[:, D:]
2025-05-07T20:33:19.0004041Z     
2025-05-07T20:33:19.0004230Z         if contiguous:
2025-05-07T20:33:19.0004463Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0004726Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0004969Z     
2025-05-07T20:33:19.0005167Z         if scale_ub is not None:
2025-05-07T20:33:19.0005445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0005776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0006093Z             )
2025-05-07T20:33:19.0006297Z         else:
2025-05-07T20:33:19.0006512Z             scale_ub_tensor = None
2025-05-07T20:33:19.0006765Z     
2025-05-07T20:33:19.0007007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0007332Z             op = silu_mul_quant
2025-05-07T20:33:19.0007582Z             if compiled:
2025-05-07T20:33:19.0007834Z                 op = torch.compile(op)
2025-05-07T20:33:19.0008138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0008415Z     
2025-05-07T20:33:19.0008611Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0008775Z 
2025-05-07T20:33:19.0008882Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0009176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0009510Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0009795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0010355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.0010918Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.0011702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0012456Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0012988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0013736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0014410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0014941Z     kernel = self.compile(
2025-05-07T20:33:19.0015481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0016137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0016537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0016771Z 
2025-05-07T20:33:19.0016996Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffeab40>
2025-05-07T20:33:19.0018122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0019503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5de40>}
2025-05-07T20:33:19.0020850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0021881Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9b0f30>
2025-05-07T20:33:19.0022173Z 
2025-05-07T20:33:19.0022346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0022880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0023361Z                            module_map=module_map)
2025-05-07T20:33:19.0023737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0024096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0024360Z E       ^
2025-05-07T20:33:19.0024831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0025282Z 
2025-05-07T20:33:19.0025697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0026218Z 
2025-05-07T20:33:19.0026326Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.0026747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.0027161Z     T=1,
2025-05-07T20:33:19.0027345Z     D=5120,
2025-05-07T20:33:19.0027553Z     scale_ub=1200.0,
2025-05-07T20:33:19.0027784Z     contiguous=False,
2025-05-07T20:33:19.0028009Z     compiled=False,
2025-05-07T20:33:19.0028223Z )
2025-05-07T20:33:19.0028552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.0029045Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.0029317Z 
2025-05-07T20:33:19.0029399Z     @given(
2025-05-07T20:33:19.0029635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.0029951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.0030261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.0030590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.0030923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.0031205Z     )
2025-05-07T20:33:19.0031561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.0032101Z     def test_silu_mul_quant(
2025-05-07T20:33:19.0032348Z         self,
2025-05-07T20:33:19.0032549Z         T: int,
2025-05-07T20:33:19.0032749Z         D: int,
2025-05-07T20:33:19.0032965Z         scale_ub: Optional[float],
2025-05-07T20:33:19.0033282Z         contiguous: bool,
2025-05-07T20:33:19.0033547Z         compiled: bool,
2025-05-07T20:33:19.0033777Z     ) -> None:
2025-05-07T20:33:19.0033996Z         torch.manual_seed(2025)
2025-05-07T20:33:19.0034245Z     
2025-05-07T20:33:19.0034520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.0034860Z     
2025-05-07T20:33:19.0035056Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0035346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0035667Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0035967Z         x0 = x[:, :D]
2025-05-07T20:33:19.0036193Z         x1 = x[:, D:]
2025-05-07T20:33:19.0036413Z     
2025-05-07T20:33:19.0036598Z         if contiguous:
2025-05-07T20:33:19.0036844Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0037156Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0037398Z     
2025-05-07T20:33:19.0037595Z         if scale_ub is not None:
2025-05-07T20:33:19.0037900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0038260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0038576Z             )
2025-05-07T20:33:19.0038774Z         else:
2025-05-07T20:33:19.0038988Z             scale_ub_tensor = None
2025-05-07T20:33:19.0039240Z     
2025-05-07T20:33:19.0039471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0039793Z             op = silu_mul_quant
2025-05-07T20:33:19.0040052Z             if compiled:
2025-05-07T20:33:19.0040305Z                 op = torch.compile(op)
2025-05-07T20:33:19.0040607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0040884Z     
2025-05-07T20:33:19.0041086Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0041253Z 
2025-05-07T20:33:19.0041367Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0041657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0041994Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0042281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0042961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0043647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0044182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0044866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0045543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0046073Z     kernel = self.compile(
2025-05-07T20:33:19.0046624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0047287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0047682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0047921Z 
2025-05-07T20:33:19.0048128Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffebbc0>
2025-05-07T20:33:19.0049206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0050578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5eac0>}
2025-05-07T20:33:19.0052001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0053072Z context = <triton._C.libtriton.ir.context object at 0x7f32cfb57270>
2025-05-07T20:33:19.0053405Z 
2025-05-07T20:33:19.0053577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0054097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0054566Z                            module_map=module_map)
2025-05-07T20:33:19.0054925Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0055284Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0055546Z E       ^
2025-05-07T20:33:19.0056008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0056463Z 
2025-05-07T20:33:19.0056882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0057398Z 
2025-05-07T20:33:19.0057548Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.0057962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.0058365Z     T=16384,
2025-05-07T20:33:19.0058556Z     D=5120,
2025-05-07T20:33:19.0058749Z     scale_ub=1200.0,
2025-05-07T20:33:19.0058969Z     contiguous=False,
2025-05-07T20:33:19.0059194Z     compiled=True,
2025-05-07T20:33:19.0059402Z )
2025-05-07T20:33:19.0929785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.0930563Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.0930967Z 
2025-05-07T20:33:19.0931078Z     @given(
2025-05-07T20:33:19.0931404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.0931812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.0932128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.0932467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.0932797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.0933081Z     )
2025-05-07T20:33:19.0933440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.0933886Z     def test_silu_mul_quant(
2025-05-07T20:33:19.0934124Z         self,
2025-05-07T20:33:19.0934322Z         T: int,
2025-05-07T20:33:19.0934526Z         D: int,
2025-05-07T20:33:19.0934742Z         scale_ub: Optional[float],
2025-05-07T20:33:19.0935013Z         contiguous: bool,
2025-05-07T20:33:19.0935254Z         compiled: bool,
2025-05-07T20:33:19.0935481Z     ) -> None:
2025-05-07T20:33:19.0935696Z         torch.manual_seed(2025)
2025-05-07T20:33:19.0935940Z     
2025-05-07T20:33:19.0936218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.0936567Z     
2025-05-07T20:33:19.0936759Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0937055Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0937363Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0937605Z         x0 = x[:, :D]
2025-05-07T20:33:19.0937824Z         x1 = x[:, D:]
2025-05-07T20:33:19.0938043Z     
2025-05-07T20:33:19.0938233Z         if contiguous:
2025-05-07T20:33:19.0938483Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0938761Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0939020Z     
2025-05-07T20:33:19.0939216Z         if scale_ub is not None:
2025-05-07T20:33:19.0939511Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0939885Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0940221Z             )
2025-05-07T20:33:19.0940422Z         else:
2025-05-07T20:33:19.0940642Z             scale_ub_tensor = None
2025-05-07T20:33:19.0940909Z     
2025-05-07T20:33:19.0941153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0941631Z             op = silu_mul_quant
2025-05-07T20:33:19.0941938Z             if compiled:
2025-05-07T20:33:19.0942192Z                 op = torch.compile(op)
2025-05-07T20:33:19.0942486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0942821Z     
2025-05-07T20:33:19.0943021Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0943191Z 
2025-05-07T20:33:19.0943291Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0943590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0943920Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0944206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0944769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.0945327Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.0945983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0946677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0947290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0947968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0948634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0949169Z     kernel = self.compile(
2025-05-07T20:33:19.0949705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0950362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0950759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0950988Z 
2025-05-07T20:33:19.0951209Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b4230>
2025-05-07T20:33:19.0952289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0953663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf910180>}
2025-05-07T20:33:19.0955002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0956143Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9912f0>
2025-05-07T20:33:19.0956431Z 
2025-05-07T20:33:19.0956600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0957123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0957613Z                            module_map=module_map)
2025-05-07T20:33:19.0958029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0958380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0958651Z E       ^
2025-05-07T20:33:19.0959115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0959567Z 
2025-05-07T20:33:19.0959986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0960497Z 
2025-05-07T20:33:19.0960602Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.0961017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.0961422Z     T=2048,
2025-05-07T20:33:19.0961610Z     D=7168,
2025-05-07T20:33:19.0961829Z     scale_ub=1200.0,
2025-05-07T20:33:19.0962055Z     contiguous=False,
2025-05-07T20:33:19.0962388Z     compiled=True,
2025-05-07T20:33:19.0962594Z )
2025-05-07T20:33:19.0962920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.0963415Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.0963732Z 
2025-05-07T20:33:19.0963817Z     @given(
2025-05-07T20:33:19.0964042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.0964354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.0964661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.0964986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.0965326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.0965940Z     )
2025-05-07T20:33:19.0966288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.0966728Z     def test_silu_mul_quant(
2025-05-07T20:33:19.0966972Z         self,
2025-05-07T20:33:19.0967162Z         T: int,
2025-05-07T20:33:19.0967369Z         D: int,
2025-05-07T20:33:19.0967677Z         scale_ub: Optional[float],
2025-05-07T20:33:19.0967977Z         contiguous: bool,
2025-05-07T20:33:19.0968238Z         compiled: bool,
2025-05-07T20:33:19.0968468Z     ) -> None:
2025-05-07T20:33:19.0968687Z         torch.manual_seed(2025)
2025-05-07T20:33:19.0968926Z     
2025-05-07T20:33:19.0969195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.0969539Z     
2025-05-07T20:33:19.0969731Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0970019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0970330Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0970567Z         x0 = x[:, :D]
2025-05-07T20:33:19.0970789Z         x1 = x[:, D:]
2025-05-07T20:33:19.0970999Z     
2025-05-07T20:33:19.0971183Z         if contiguous:
2025-05-07T20:33:19.0971411Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0971669Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0971908Z     
2025-05-07T20:33:19.0972100Z         if scale_ub is not None:
2025-05-07T20:33:19.0972376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0972703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0973014Z             )
2025-05-07T20:33:19.0973208Z         else:
2025-05-07T20:33:19.0973418Z             scale_ub_tensor = None
2025-05-07T20:33:19.0973664Z     
2025-05-07T20:33:19.0973900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0974214Z             op = silu_mul_quant
2025-05-07T20:33:19.0974463Z             if compiled:
2025-05-07T20:33:19.0974709Z                 op = torch.compile(op)
2025-05-07T20:33:19.0975002Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0975269Z     
2025-05-07T20:33:19.0975464Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0975627Z 
2025-05-07T20:33:19.0975731Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0976022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0976356Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0976641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0977198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.0977756Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.0978412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0979098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0979629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0980305Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0980972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0981637Z     kernel = self.compile(
2025-05-07T20:33:19.0982177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0982831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0983291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0983524Z 
2025-05-07T20:33:19.0983738Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b60f0>
2025-05-07T20:33:19.0984819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0986191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf910ea0>}
2025-05-07T20:33:19.0987633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0988714Z context = <triton._C.libtriton.ir.context object at 0x7f32cfbe04b0>
2025-05-07T20:33:19.0989002Z 
2025-05-07T20:33:19.0989168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0989689Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0990163Z                            module_map=module_map)
2025-05-07T20:33:19.0990529Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0990882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0991146Z E       ^
2025-05-07T20:33:19.0991608Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0992058Z 
2025-05-07T20:33:19.0992480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0992992Z 
2025-05-07T20:33:19.2150540Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2151148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2151746Z     T=1,
2025-05-07T20:33:19.2151999Z     D=5120,
2025-05-07T20:33:19.2152269Z     scale_ub=None,
2025-05-07T20:33:19.2152567Z     contiguous=False,
2025-05-07T20:33:19.2152823Z     compiled=False,
2025-05-07T20:33:19.2153028Z )
2025-05-07T20:33:19.2153354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2153848Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2154113Z 
2025-05-07T20:33:19.2154192Z     @given(
2025-05-07T20:33:19.2154435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2154755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2155065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2155403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2155835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2156130Z     )
2025-05-07T20:33:19.2156487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2156940Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2157189Z         self,
2025-05-07T20:33:19.2157385Z         T: int,
2025-05-07T20:33:19.2157593Z         D: int,
2025-05-07T20:33:19.2157816Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2158138Z         contiguous: bool,
2025-05-07T20:33:19.2158391Z         compiled: bool,
2025-05-07T20:33:19.2158625Z     ) -> None:
2025-05-07T20:33:19.2158841Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2159092Z     
2025-05-07T20:33:19.2159377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2159899Z     
2025-05-07T20:33:19.2160172Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2160468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2160788Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2161099Z         x0 = x[:, :D]
2025-05-07T20:33:19.2161318Z         x1 = x[:, D:]
2025-05-07T20:33:19.2161532Z     
2025-05-07T20:33:19.2161722Z         if contiguous:
2025-05-07T20:33:19.2161958Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2162221Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2162472Z     
2025-05-07T20:33:19.2162665Z         if scale_ub is not None:
2025-05-07T20:33:19.2162951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2163295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2163604Z             )
2025-05-07T20:33:19.2163803Z         else:
2025-05-07T20:33:19.2164021Z             scale_ub_tensor = None
2025-05-07T20:33:19.2164279Z     
2025-05-07T20:33:19.2164518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2164849Z             op = silu_mul_quant
2025-05-07T20:33:19.2165179Z             if compiled:
2025-05-07T20:33:19.2165773Z                 op = torch.compile(op)
2025-05-07T20:33:19.2166151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2166432Z     
2025-05-07T20:33:19.2166631Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2166802Z 
2025-05-07T20:33:19.2166905Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2167216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2167549Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2167842Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2168536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2169236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2169779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2170596Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2171399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2172172Z     kernel = self.compile(
2025-05-07T20:33:19.2172817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2179513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2179928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2180171Z 
2025-05-07T20:33:19.2180381Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b7320>
2025-05-07T20:33:19.2181492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2182880Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf911e40>}
2025-05-07T20:33:19.2184238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2185266Z context = <triton._C.libtriton.ir.context object at 0x7f32cfcd45f0>
2025-05-07T20:33:19.2185567Z 
2025-05-07T20:33:19.2185739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2186280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2186758Z                            module_map=module_map)
2025-05-07T20:33:19.2187308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2187761Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2188058Z E       ^
2025-05-07T20:33:19.2188525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2189042Z 
2025-05-07T20:33:19.2189460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2189984Z 
2025-05-07T20:33:19.2190091Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2190515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2190925Z     T=4096,
2025-05-07T20:33:19.2191119Z     D=7168,
2025-05-07T20:33:19.2191322Z     scale_ub=1200.0,
2025-05-07T20:33:19.2191549Z     contiguous=False,
2025-05-07T20:33:19.2191794Z     compiled=False,
2025-05-07T20:33:19.2192010Z )
2025-05-07T20:33:19.2192332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2192905Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2193193Z 
2025-05-07T20:33:19.2193275Z     @given(
2025-05-07T20:33:19.2193522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2193845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2194164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2194505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2194841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2195134Z     )
2025-05-07T20:33:19.2195494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2196004Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2196253Z         self,
2025-05-07T20:33:19.2196458Z         T: int,
2025-05-07T20:33:19.2196663Z         D: int,
2025-05-07T20:33:19.2196894Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2197189Z         contiguous: bool,
2025-05-07T20:33:19.2197441Z         compiled: bool,
2025-05-07T20:33:19.2197673Z     ) -> None:
2025-05-07T20:33:19.2197906Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2198173Z     
2025-05-07T20:33:19.2198451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2198813Z     
2025-05-07T20:33:19.2199024Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2199322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2199650Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2199904Z         x0 = x[:, :D]
2025-05-07T20:33:19.2200122Z         x1 = x[:, D:]
2025-05-07T20:33:19.2200342Z     
2025-05-07T20:33:19.2200544Z         if contiguous:
2025-05-07T20:33:19.2200783Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2201054Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2201310Z     
2025-05-07T20:33:19.2201509Z         if scale_ub is not None:
2025-05-07T20:33:19.2201802Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2202155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2202468Z             )
2025-05-07T20:33:19.2202673Z         else:
2025-05-07T20:33:19.2202898Z             scale_ub_tensor = None
2025-05-07T20:33:19.2203162Z     
2025-05-07T20:33:19.2203393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2203720Z             op = silu_mul_quant
2025-05-07T20:33:19.2203988Z             if compiled:
2025-05-07T20:33:19.2204238Z                 op = torch.compile(op)
2025-05-07T20:33:19.2204540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2204825Z     
2025-05-07T20:33:19.2205020Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2205207Z 
2025-05-07T20:33:19.2205312Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2205621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2205959Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2206378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2207079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2207823Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2208371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2209065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2209745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2210280Z     kernel = self.compile(
2025-05-07T20:33:19.2210836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2211501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2211918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2212154Z 
2025-05-07T20:33:19.2212405Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b6ed0>
2025-05-07T20:33:19.2213494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2214889Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf913380>}
2025-05-07T20:33:19.2216246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2217291Z context = <triton._C.libtriton.ir.context object at 0x7f32cf6ad930>
2025-05-07T20:33:19.2217586Z 
2025-05-07T20:33:19.2217763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2218305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2218793Z                            module_map=module_map)
2025-05-07T20:33:19.2219160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2219525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2219799Z E       ^
2025-05-07T20:33:19.2220272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2220724Z 
2025-05-07T20:33:19.2221145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2221663Z 
2025-05-07T20:33:19.2221770Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2222196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2222616Z     T=16384,
2025-05-07T20:33:19.2222812Z     D=7168,
2025-05-07T20:33:19.2223021Z     scale_ub=None,
2025-05-07T20:33:19.2223252Z     contiguous=True,
2025-05-07T20:33:19.2223483Z     compiled=True,
2025-05-07T20:33:19.2223688Z )
2025-05-07T20:33:19.3970326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3970877Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.3971165Z 
2025-05-07T20:33:19.3971255Z     @given(
2025-05-07T20:33:19.3971486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3971804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3972117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3972453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3972780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3973070Z     )
2025-05-07T20:33:19.3973596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3974103Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3974345Z         self,
2025-05-07T20:33:19.3974552Z         T: int,
2025-05-07T20:33:19.3974824Z         D: int,
2025-05-07T20:33:19.3975039Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3975312Z         contiguous: bool,
2025-05-07T20:33:19.3975552Z         compiled: bool,
2025-05-07T20:33:19.3975778Z     ) -> None:
2025-05-07T20:33:19.3975994Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3976241Z     
2025-05-07T20:33:19.3976514Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3976860Z     
2025-05-07T20:33:19.3977059Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.3977344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.3977659Z         x = x_sign * x_clamp
2025-05-07T20:33:19.3977901Z         x0 = x[:, :D]
2025-05-07T20:33:19.3978119Z         x1 = x[:, D:]
2025-05-07T20:33:19.3978333Z     
2025-05-07T20:33:19.3978524Z         if contiguous:
2025-05-07T20:33:19.3978821Z             x0 = x0.contiguous()
2025-05-07T20:33:19.3979088Z             x1 = x1.contiguous()
2025-05-07T20:33:19.3979331Z     
2025-05-07T20:33:19.3979519Z         if scale_ub is not None:
2025-05-07T20:33:19.3979796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.3980136Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.3980453Z             )
2025-05-07T20:33:19.3980645Z         else:
2025-05-07T20:33:19.3980856Z             scale_ub_tensor = None
2025-05-07T20:33:19.3981101Z     
2025-05-07T20:33:19.3981335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.3981654Z             op = silu_mul_quant
2025-05-07T20:33:19.3981903Z             if compiled:
2025-05-07T20:33:19.3982153Z                 op = torch.compile(op)
2025-05-07T20:33:19.3982462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3982737Z     
2025-05-07T20:33:19.3982934Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.3983099Z 
2025-05-07T20:33:19.3983201Z moe/activation_test.py:117: 
2025-05-07T20:33:19.3983500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3983840Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.3984119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3984679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.3985242Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.3985898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.3986583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.3987120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.3987810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.3988472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.3989008Z     kernel = self.compile(
2025-05-07T20:33:19.3989549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.3990199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.3990598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3990835Z 
2025-05-07T20:33:19.3991040Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf6345f0>
2025-05-07T20:33:19.3992129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.3993604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a44a0>}
2025-05-07T20:33:19.3994940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.3996063Z context = <triton._C.libtriton.ir.context object at 0x7f32cf609df0>
2025-05-07T20:33:19.3996351Z 
2025-05-07T20:33:19.3996520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.3997039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.3997502Z                            module_map=module_map)
2025-05-07T20:33:19.3997868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.3998273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.3998531Z E       ^
2025-05-07T20:33:19.3999037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.3999491Z 
2025-05-07T20:33:19.3999904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.4000413Z 
2025-05-07T20:33:19.4000520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.4000928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.4001332Z     T=4096,
2025-05-07T20:33:19.4001526Z     D=5120,
2025-05-07T20:33:19.4001723Z     scale_ub=None,
2025-05-07T20:33:19.4001942Z     contiguous=False,
2025-05-07T20:33:19.4002169Z     compiled=True,
2025-05-07T20:33:19.4002368Z )
2025-05-07T20:33:19.4002690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.4003188Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.4003465Z 
2025-05-07T20:33:19.4003545Z     @given(
2025-05-07T20:33:19.4003778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.4004098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.4004419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.4004749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.4005081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.4005366Z     )
2025-05-07T20:33:19.4005712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.4006168Z     def test_silu_mul_quant(
2025-05-07T20:33:19.4006412Z         self,
2025-05-07T20:33:19.4006601Z         T: int,
2025-05-07T20:33:19.4006803Z         D: int,
2025-05-07T20:33:19.4007020Z         scale_ub: Optional[float],
2025-05-07T20:33:19.4007287Z         contiguous: bool,
2025-05-07T20:33:19.4007528Z         compiled: bool,
2025-05-07T20:33:19.4007757Z     ) -> None:
2025-05-07T20:33:19.4007978Z         torch.manual_seed(2025)
2025-05-07T20:33:19.4008236Z     
2025-05-07T20:33:19.4008513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.4008863Z     
2025-05-07T20:33:19.4009052Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.4009340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.4009652Z         x = x_sign * x_clamp
2025-05-07T20:33:19.4009890Z         x0 = x[:, :D]
2025-05-07T20:33:19.4010107Z         x1 = x[:, D:]
2025-05-07T20:33:19.4010320Z     
2025-05-07T20:33:19.4010500Z         if contiguous:
2025-05-07T20:33:19.4010742Z             x0 = x0.contiguous()
2025-05-07T20:33:19.4011004Z             x1 = x1.contiguous()
2025-05-07T20:33:19.4011239Z     
2025-05-07T20:33:19.4011431Z         if scale_ub is not None:
2025-05-07T20:33:19.4011701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.4012038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.4012432Z             )
2025-05-07T20:33:19.4012628Z         else:
2025-05-07T20:33:19.4012841Z             scale_ub_tensor = None
2025-05-07T20:33:19.4013090Z     
2025-05-07T20:33:19.4013323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.4013713Z             op = silu_mul_quant
2025-05-07T20:33:19.4013959Z             if compiled:
2025-05-07T20:33:19.4014209Z                 op = torch.compile(op)
2025-05-07T20:33:19.4014507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.4014774Z     
2025-05-07T20:33:19.4014970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.4015133Z 
2025-05-07T20:33:19.4015237Z moe/activation_test.py:117: 
2025-05-07T20:33:19.4015525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.4015855Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.4016134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.4016692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.4017413Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.4018117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.4018803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.4019332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.4020006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.4020666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.4021195Z     kernel = self.compile(
2025-05-07T20:33:19.4021725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.4022375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.4022778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.4023005Z 
2025-05-07T20:33:19.4023213Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf636270>
2025-05-07T20:33:19.4024285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.4025650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a51c0>}
2025-05-07T20:33:19.4026995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.4028044Z context = <triton._C.libtriton.ir.context object at 0x7f32cfcdf8b0>
2025-05-07T20:33:19.4028356Z 
2025-05-07T20:33:19.4028531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.4029046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.4029514Z                            module_map=module_map)
2025-05-07T20:33:19.4029876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.4030222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.4030482Z E       ^
2025-05-07T20:33:19.4030950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.4031395Z 
2025-05-07T20:33:19.4031809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.4032314Z 
2025-05-07T20:33:19.7114603Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.7115273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.7115696Z     T=4096,
2025-05-07T20:33:19.7115962Z     D=5120,
2025-05-07T20:33:19.7116157Z     scale_ub=1200.0,
2025-05-07T20:33:19.7116456Z     contiguous=False,
2025-05-07T20:33:19.7116677Z     compiled=False,
2025-05-07T20:33:19.7116885Z )
2025-05-07T20:33:19.7117209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.7117710Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.7118012Z 
2025-05-07T20:33:19.7118104Z     @given(
2025-05-07T20:33:19.7118362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.7118681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.7118993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.7119328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.7119662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.7119957Z     )
2025-05-07T20:33:19.7120370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.7120821Z     def test_silu_mul_quant(
2025-05-07T20:33:19.7121070Z         self,
2025-05-07T20:33:19.7121268Z         T: int,
2025-05-07T20:33:19.7121472Z         D: int,
2025-05-07T20:33:19.7121687Z         scale_ub: Optional[float],
2025-05-07T20:33:19.7121966Z         contiguous: bool,
2025-05-07T20:33:19.7122212Z         compiled: bool,
2025-05-07T20:33:19.7122440Z     ) -> None:
2025-05-07T20:33:19.7122655Z         torch.manual_seed(2025)
2025-05-07T20:33:19.7122908Z     
2025-05-07T20:33:19.7123184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.7123526Z     
2025-05-07T20:33:19.7123724Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.7124017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.7124328Z         x = x_sign * x_clamp
2025-05-07T20:33:19.7124582Z         x0 = x[:, :D]
2025-05-07T20:33:19.7124809Z         x1 = x[:, D:]
2025-05-07T20:33:19.7125014Z     
2025-05-07T20:33:19.7125208Z         if contiguous:
2025-05-07T20:33:19.7125441Z             x0 = x0.contiguous()
2025-05-07T20:33:19.7125704Z             x1 = x1.contiguous()
2025-05-07T20:33:19.7125950Z     
2025-05-07T20:33:19.7126150Z         if scale_ub is not None:
2025-05-07T20:33:19.7126421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.7126761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.7127071Z             )
2025-05-07T20:33:19.7127272Z         else:
2025-05-07T20:33:19.7127485Z             scale_ub_tensor = None
2025-05-07T20:33:19.7127740Z     
2025-05-07T20:33:19.7127977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.7128292Z             op = silu_mul_quant
2025-05-07T20:33:19.7128544Z             if compiled:
2025-05-07T20:33:19.7128797Z                 op = torch.compile(op)
2025-05-07T20:33:19.7129097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7129386Z     
2025-05-07T20:33:19.7129583Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.7129748Z 
2025-05-07T20:33:19.7129850Z moe/activation_test.py:117: 
2025-05-07T20:33:19.7130159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7130505Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.7130783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7131480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.7132172Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.7132709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.7133389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.7134106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.7134680Z     kernel = self.compile(
2025-05-07T20:33:19.7135227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.7135927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.7136324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7136554Z 
2025-05-07T20:33:19.7136764Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf637770>
2025-05-07T20:33:19.7137849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.7139221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a6160>}
2025-05-07T20:33:19.7140608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.7141644Z context = <triton._C.libtriton.ir.context object at 0x7f32cfccee30>
2025-05-07T20:33:19.7141937Z 
2025-05-07T20:33:19.7142109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.7142630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.7143102Z                            module_map=module_map)
2025-05-07T20:33:19.7143468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.7143828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.7144085Z E       ^
2025-05-07T20:33:19.7144559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.7145012Z 
2025-05-07T20:33:19.7145435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.7145949Z 
2025-05-07T20:33:19.7146059Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.7146496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.7146899Z     T=4096,
2025-05-07T20:33:19.7147093Z     D=5120,
2025-05-07T20:33:19.7147285Z     scale_ub=1200.0,
2025-05-07T20:33:19.7147516Z     contiguous=False,
2025-05-07T20:33:19.7147740Z     compiled=True,
2025-05-07T20:33:19.7147950Z )
2025-05-07T20:33:19.7148267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.7148773Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.7149047Z 
2025-05-07T20:33:19.7149132Z     @given(
2025-05-07T20:33:19.7149369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.7149692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.7150005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.7150336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.7150667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.7150962Z     )
2025-05-07T20:33:19.7151313Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.7151764Z     def test_silu_mul_quant(
2025-05-07T20:33:19.7152004Z         self,
2025-05-07T20:33:19.7152204Z         T: int,
2025-05-07T20:33:19.7152404Z         D: int,
2025-05-07T20:33:19.7152619Z         scale_ub: Optional[float],
2025-05-07T20:33:19.7152893Z         contiguous: bool,
2025-05-07T20:33:19.7158652Z         compiled: bool,
2025-05-07T20:33:19.7158892Z     ) -> None:
2025-05-07T20:33:19.7159117Z         torch.manual_seed(2025)
2025-05-07T20:33:19.7159370Z     
2025-05-07T20:33:19.7159718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.7160098Z     
2025-05-07T20:33:19.7160300Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.7160583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.7160940Z         x = x_sign * x_clamp
2025-05-07T20:33:19.7161184Z         x0 = x[:, :D]
2025-05-07T20:33:19.7161401Z         x1 = x[:, D:]
2025-05-07T20:33:19.7161610Z     
2025-05-07T20:33:19.7161804Z         if contiguous:
2025-05-07T20:33:19.7162035Z             x0 = x0.contiguous()
2025-05-07T20:33:19.7162292Z             x1 = x1.contiguous()
2025-05-07T20:33:19.7162533Z     
2025-05-07T20:33:19.7162722Z         if scale_ub is not None:
2025-05-07T20:33:19.7162995Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.7163340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.7163656Z             )
2025-05-07T20:33:19.7163846Z         else:
2025-05-07T20:33:19.7164058Z             scale_ub_tensor = None
2025-05-07T20:33:19.7164321Z     
2025-05-07T20:33:19.7164595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.7164916Z             op = silu_mul_quant
2025-05-07T20:33:19.7165170Z             if compiled:
2025-05-07T20:33:19.7165663Z                 op = torch.compile(op)
2025-05-07T20:33:19.7165965Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7166245Z     
2025-05-07T20:33:19.7166437Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.7166605Z 
2025-05-07T20:33:19.7166707Z moe/activation_test.py:117: 
2025-05-07T20:33:19.7167009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7167345Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.7167624Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7168184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.7168744Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.7169403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.7170088Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.7170629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.7171304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.7171961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.7172494Z     kernel = self.compile(
2025-05-07T20:33:19.7173032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.7173680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.7174083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7174324Z 
2025-05-07T20:33:19.7174532Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf635970>
2025-05-07T20:33:19.7175753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.7177134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a7240>}
2025-05-07T20:33:19.7178467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.7179492Z context = <triton._C.libtriton.ir.context object at 0x7f32cfc3fbf0>
2025-05-07T20:33:19.7179779Z 
2025-05-07T20:33:19.7180039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.7180646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.7181113Z                            module_map=module_map)
2025-05-07T20:33:19.7181543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.7181904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.7182160Z E       ^
2025-05-07T20:33:19.7182632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.7183082Z 
2025-05-07T20:33:19.7183501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.7184010Z 
2025-05-07T20:33:19.8328876Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8329334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8329747Z     T=2048,
2025-05-07T20:33:19.8329955Z     D=7168,
2025-05-07T20:33:19.8330157Z     scale_ub=1200.0,
2025-05-07T20:33:19.8330494Z     contiguous=False,
2025-05-07T20:33:19.8330722Z     compiled=False,
2025-05-07T20:33:19.8330938Z )
2025-05-07T20:33:19.8331262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8331759Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.8332039Z 
2025-05-07T20:33:19.8332127Z     @given(
2025-05-07T20:33:19.8332358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8332678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8332994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8333332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8333657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8333954Z     )
2025-05-07T20:33:19.8334309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8334755Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8335007Z         self,
2025-05-07T20:33:19.8335208Z         T: int,
2025-05-07T20:33:19.8335409Z         D: int,
2025-05-07T20:33:19.8335634Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8335914Z         contiguous: bool,
2025-05-07T20:33:19.8336149Z         compiled: bool,
2025-05-07T20:33:19.8336373Z     ) -> None:
2025-05-07T20:33:19.8336594Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8336832Z     
2025-05-07T20:33:19.8337108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8337460Z     
2025-05-07T20:33:19.8337649Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8337943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8338254Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8338500Z         x0 = x[:, :D]
2025-05-07T20:33:19.8338713Z         x1 = x[:, D:]
2025-05-07T20:33:19.8338925Z     
2025-05-07T20:33:19.8339116Z         if contiguous:
2025-05-07T20:33:19.8339346Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8339611Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8339856Z     
2025-05-07T20:33:19.8340075Z         if scale_ub is not None:
2025-05-07T20:33:19.8340355Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8340692Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8341003Z             )
2025-05-07T20:33:19.8341196Z         else:
2025-05-07T20:33:19.8341405Z             scale_ub_tensor = None
2025-05-07T20:33:19.8341656Z     
2025-05-07T20:33:19.8341892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8342201Z             op = silu_mul_quant
2025-05-07T20:33:19.8342453Z             if compiled:
2025-05-07T20:33:19.8342704Z                 op = torch.compile(op)
2025-05-07T20:33:19.8342998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8343281Z     
2025-05-07T20:33:19.8343471Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8343769Z 
2025-05-07T20:33:19.8343874Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8344182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8344572Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8344863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8345553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8346250Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8346785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8347471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8348141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8348667Z     kernel = self.compile(
2025-05-07T20:33:19.8349268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8349921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8350325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8350553Z 
2025-05-07T20:33:19.8350758Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f04d0>
2025-05-07T20:33:19.8351838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8353206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72c220>}
2025-05-07T20:33:19.8354550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8355569Z context = <triton._C.libtriton.ir.context object at 0x7f32cf4ca6f0>
2025-05-07T20:33:19.8355925Z 
2025-05-07T20:33:19.8356091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8356611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8357082Z                            module_map=module_map)
2025-05-07T20:33:19.8357437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8357791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8358056Z E       ^
2025-05-07T20:33:19.8358562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8359016Z 
2025-05-07T20:33:19.8359428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8359946Z 
2025-05-07T20:33:19.8360051Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8360465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8360859Z     T=1,
2025-05-07T20:33:19.8361049Z     D=7168,
2025-05-07T20:33:19.8361238Z     scale_ub=None,
2025-05-07T20:33:19.8361452Z     contiguous=True,
2025-05-07T20:33:19.8361678Z     compiled=False,
2025-05-07T20:33:19.8361880Z )
2025-05-07T20:33:19.8362198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8362692Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.8362957Z 
2025-05-07T20:33:19.8363038Z     @given(
2025-05-07T20:33:19.8363275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8363585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8363945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8364322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8364649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8364974Z     )
2025-05-07T20:33:19.8365322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8366006Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8366244Z         self,
2025-05-07T20:33:19.8366442Z         T: int,
2025-05-07T20:33:19.8366641Z         D: int,
2025-05-07T20:33:19.8366857Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8367128Z         contiguous: bool,
2025-05-07T20:33:19.8367367Z         compiled: bool,
2025-05-07T20:33:19.8367587Z     ) -> None:
2025-05-07T20:33:19.8367798Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8368040Z     
2025-05-07T20:33:19.8368311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8368654Z     
2025-05-07T20:33:19.8368854Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8369212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8369524Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8369771Z         x0 = x[:, :D]
2025-05-07T20:33:19.8369987Z         x1 = x[:, D:]
2025-05-07T20:33:19.8370195Z     
2025-05-07T20:33:19.8370384Z         if contiguous:
2025-05-07T20:33:19.8370607Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8370869Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8371109Z     
2025-05-07T20:33:19.8371300Z         if scale_ub is not None:
2025-05-07T20:33:19.8371575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8371908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8372214Z             )
2025-05-07T20:33:19.8372408Z         else:
2025-05-07T20:33:19.8372622Z             scale_ub_tensor = None
2025-05-07T20:33:19.8372875Z     
2025-05-07T20:33:19.8373105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8373424Z             op = silu_mul_quant
2025-05-07T20:33:19.8373675Z             if compiled:
2025-05-07T20:33:19.8373922Z                 op = torch.compile(op)
2025-05-07T20:33:19.8374220Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8374494Z     
2025-05-07T20:33:19.8374686Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8374852Z 
2025-05-07T20:33:19.8374950Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8375244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8375568Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8375848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8376534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8377221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8377757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8378439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8379109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8379634Z     kernel = self.compile(
2025-05-07T20:33:19.8380171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8380821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8381221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8381449Z 
2025-05-07T20:33:19.8381653Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f2630>
2025-05-07T20:33:19.8382808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8384235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72d120>}
2025-05-07T20:33:19.8385630Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8386661Z context = <triton._C.libtriton.ir.context object at 0x7f32cf5471b0>
2025-05-07T20:33:19.8386950Z 
2025-05-07T20:33:19.8387118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8387647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8388118Z                            module_map=module_map)
2025-05-07T20:33:19.8388532Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8388896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8389205Z E       ^
2025-05-07T20:33:19.8389670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8390124Z 
2025-05-07T20:33:19.8390535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8391050Z 
2025-05-07T20:33:19.8391159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8391572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8391977Z     T=16384,
2025-05-07T20:33:19.8392171Z     D=7168,
2025-05-07T20:33:19.8392372Z     scale_ub=1200.0,
2025-05-07T20:33:19.8392601Z     contiguous=False,
2025-05-07T20:33:19.8392824Z     compiled=True,
2025-05-07T20:33:20.0812626Z )
2025-05-07T20:33:20.0813265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.0813955Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:20.0814246Z 
2025-05-07T20:33:20.0814354Z     @given(
2025-05-07T20:33:20.0814581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.0814904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.0815208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.0815534Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.0815863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.0816150Z     )
2025-05-07T20:33:20.0816492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.0816937Z     def test_silu_mul_quant(
2025-05-07T20:33:20.0817177Z         self,
2025-05-07T20:33:20.0817378Z         T: int,
2025-05-07T20:33:20.0817568Z         D: int,
2025-05-07T20:33:20.0817786Z         scale_ub: Optional[float],
2025-05-07T20:33:20.0818062Z         contiguous: bool,
2025-05-07T20:33:20.0818327Z         compiled: bool,
2025-05-07T20:33:20.0818579Z     ) -> None:
2025-05-07T20:33:20.0818806Z         torch.manual_seed(2025)
2025-05-07T20:33:20.0819048Z     
2025-05-07T20:33:20.0819323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.0819665Z     
2025-05-07T20:33:20.0819857Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.0820151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.0820459Z         x = x_sign * x_clamp
2025-05-07T20:33:20.0820691Z         x0 = x[:, :D]
2025-05-07T20:33:20.0820911Z         x1 = x[:, D:]
2025-05-07T20:33:20.0821121Z     
2025-05-07T20:33:20.0821306Z         if contiguous:
2025-05-07T20:33:20.0821538Z             x0 = x0.contiguous()
2025-05-07T20:33:20.0821795Z             x1 = x1.contiguous()
2025-05-07T20:33:20.0822037Z     
2025-05-07T20:33:20.0822223Z         if scale_ub is not None:
2025-05-07T20:33:20.0822499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.0823025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.0823340Z             )
2025-05-07T20:33:20.0823538Z         else:
2025-05-07T20:33:20.0823757Z             scale_ub_tensor = None
2025-05-07T20:33:20.0824078Z     
2025-05-07T20:33:20.0824310Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.0824627Z             op = silu_mul_quant
2025-05-07T20:33:20.0824872Z             if compiled:
2025-05-07T20:33:20.0825118Z                 op = torch.compile(op)
2025-05-07T20:33:20.0825409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.0825677Z     
2025-05-07T20:33:20.0825874Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.0826036Z 
2025-05-07T20:33:20.0826141Z moe/activation_test.py:117: 
2025-05-07T20:33:20.0826436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.0826760Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.0827039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.0827663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.0828216Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.0828876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.0829556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.0830090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.0830762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.0831422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.0831949Z     kernel = self.compile(
2025-05-07T20:33:20.0832480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.0833136Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.0833534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.0833766Z 
2025-05-07T20:33:20.0833975Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f1c70>
2025-05-07T20:33:20.0835049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.0836544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72e520>}
2025-05-07T20:33:20.0837889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.0838928Z context = <triton._C.libtriton.ir.context object at 0x7f32cf7ee030>
2025-05-07T20:33:20.0839215Z 
2025-05-07T20:33:20.0839383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.0839902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.0840367Z                            module_map=module_map)
2025-05-07T20:33:20.0840732Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.0841082Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.0841339Z E       ^
2025-05-07T20:33:20.0841803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.0842252Z 
2025-05-07T20:33:20.0842666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.0843176Z 
2025-05-07T20:33:20.0843375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.0843845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.0844304Z     T=1,
2025-05-07T20:33:20.0844544Z     D=7168,
2025-05-07T20:33:20.0844734Z     scale_ub=None,
2025-05-07T20:33:20.0844957Z     contiguous=False,
2025-05-07T20:33:20.0845272Z     compiled=False,
2025-05-07T20:33:20.0845490Z )
2025-05-07T20:33:20.0845811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.0846295Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:20.0846564Z 
2025-05-07T20:33:20.0846642Z     @given(
2025-05-07T20:33:20.0846880Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.0847192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.0847525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.0847853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.0848194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.0848533Z     )
2025-05-07T20:33:20.0848877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.0849326Z     def test_silu_mul_quant(
2025-05-07T20:33:20.0849574Z         self,
2025-05-07T20:33:20.0849762Z         T: int,
2025-05-07T20:33:20.0849964Z         D: int,
2025-05-07T20:33:20.0850185Z         scale_ub: Optional[float],
2025-05-07T20:33:20.0850457Z         contiguous: bool,
2025-05-07T20:33:20.0850701Z         compiled: bool,
2025-05-07T20:33:20.0850919Z     ) -> None:
2025-05-07T20:33:20.0851137Z         torch.manual_seed(2025)
2025-05-07T20:33:20.0851375Z     
2025-05-07T20:33:20.0851644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.0851982Z     
2025-05-07T20:33:20.0852171Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.0852457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.0852774Z         x = x_sign * x_clamp
2025-05-07T20:33:20.0853007Z         x0 = x[:, :D]
2025-05-07T20:33:20.0853225Z         x1 = x[:, D:]
2025-05-07T20:33:20.0853446Z     
2025-05-07T20:33:20.0853708Z         if contiguous:
2025-05-07T20:33:20.0853956Z             x0 = x0.contiguous()
2025-05-07T20:33:20.0854215Z             x1 = x1.contiguous()
2025-05-07T20:33:20.0854452Z     
2025-05-07T20:33:20.0854650Z         if scale_ub is not None:
2025-05-07T20:33:20.0854989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.0855331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.0855639Z             )
2025-05-07T20:33:20.0855888Z         else:
2025-05-07T20:33:20.0856190Z             scale_ub_tensor = None
2025-05-07T20:33:20.0856555Z     
2025-05-07T20:33:20.0856802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.0857118Z             op = silu_mul_quant
2025-05-07T20:33:20.0857364Z             if compiled:
2025-05-07T20:33:20.0857614Z                 op = torch.compile(op)
2025-05-07T20:33:20.0857918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.0858200Z     
2025-05-07T20:33:20.0858427Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.0858600Z 
2025-05-07T20:33:20.0858707Z moe/activation_test.py:117: 
2025-05-07T20:33:20.0859002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.0859333Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.0859615Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.0860305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.0860993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.0861542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.0862221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.0863007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.0863549Z     kernel = self.compile(
2025-05-07T20:33:20.0864131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.0870579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.0871080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.0871319Z 
2025-05-07T20:33:20.0871525Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f3770>
2025-05-07T20:33:20.0872617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.0874162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72f100>}
2025-05-07T20:33:20.0875521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.0876637Z context = <triton._C.libtriton.ir.context object at 0x7f33f009f1f0>
2025-05-07T20:33:20.0876929Z 
2025-05-07T20:33:20.0877096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.0877619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.0878095Z                            module_map=module_map)
2025-05-07T20:33:20.0878511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.0878871Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.0879131Z E       ^
2025-05-07T20:33:20.0879601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.0880056Z 
2025-05-07T20:33:20.0880469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.0880984Z 
2025-05-07T20:33:20.0881087Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.0881500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.0881898Z     T=2048,
2025-05-07T20:33:20.0882092Z     D=7168,
2025-05-07T20:33:20.0882285Z     scale_ub=None,
2025-05-07T20:33:20.0882501Z     contiguous=False,
2025-05-07T20:33:20.0882733Z     compiled=True,
2025-05-07T20:33:20.0882937Z )
2025-05-07T20:33:20.1751459Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.1752014Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:20.1752298Z 
2025-05-07T20:33:20.1752389Z     @given(
2025-05-07T20:33:20.1752633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.1752952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.1753249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.1753582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.1753914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.1754195Z     )
2025-05-07T20:33:20.1754545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.1754989Z     def test_silu_mul_quant(
2025-05-07T20:33:20.1755229Z         self,
2025-05-07T20:33:20.1755434Z         T: int,
2025-05-07T20:33:20.1755639Z         D: int,
2025-05-07T20:33:20.1755940Z         scale_ub: Optional[float],
2025-05-07T20:33:20.1756211Z         contiguous: bool,
2025-05-07T20:33:20.1756446Z         compiled: bool,
2025-05-07T20:33:20.1756670Z     ) -> None:
2025-05-07T20:33:20.1756886Z         torch.manual_seed(2025)
2025-05-07T20:33:20.1757306Z     
2025-05-07T20:33:20.1757589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.1757931Z     
2025-05-07T20:33:20.1758127Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.1758476Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.1758782Z         x = x_sign * x_clamp
2025-05-07T20:33:20.1759027Z         x0 = x[:, :D]
2025-05-07T20:33:20.1759247Z         x1 = x[:, D:]
2025-05-07T20:33:20.1759453Z     
2025-05-07T20:33:20.1759652Z         if contiguous:
2025-05-07T20:33:20.1759891Z             x0 = x0.contiguous()
2025-05-07T20:33:20.1760148Z             x1 = x1.contiguous()
2025-05-07T20:33:20.1760393Z     
2025-05-07T20:33:20.1760593Z         if scale_ub is not None:
2025-05-07T20:33:20.1760874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.1761206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.1761526Z             )
2025-05-07T20:33:20.1761731Z         else:
2025-05-07T20:33:20.1761947Z             scale_ub_tensor = None
2025-05-07T20:33:20.1762273Z     
2025-05-07T20:33:20.1762518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.1762836Z             op = silu_mul_quant
2025-05-07T20:33:20.1763093Z             if compiled:
2025-05-07T20:33:20.1763348Z                 op = torch.compile(op)
2025-05-07T20:33:20.1763637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.1763916Z     
2025-05-07T20:33:20.1764112Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.1764275Z 
2025-05-07T20:33:20.1764375Z moe/activation_test.py:117: 
2025-05-07T20:33:20.1764673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.1765004Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.1765280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.1766252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.1766827Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.1767484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.1768184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.1768725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.1769416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.1770084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.1770618Z     kernel = self.compile(
2025-05-07T20:33:20.1771175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.1771915Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.1772320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.1772611Z 
2025-05-07T20:33:20.1772848Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55f410>
2025-05-07T20:33:20.1774040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.1775463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf558720>}
2025-05-07T20:33:20.1776806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.1777824Z context = <triton._C.libtriton.ir.context object at 0x7f32cf5ecaf0>
2025-05-07T20:33:20.1778273Z 
2025-05-07T20:33:20.1778443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.1778963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.1779488Z                            module_map=module_map)
2025-05-07T20:33:20.1779845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.1780198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.1780458Z E       ^
2025-05-07T20:33:20.1780913Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.1781366Z 
2025-05-07T20:33:20.1781778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.1782290Z 
2025-05-07T20:33:20.1782392Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.1782809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.1783204Z     T=4096,
2025-05-07T20:33:20.1783453Z     D=7168,
2025-05-07T20:33:20.1783654Z     scale_ub=None,
2025-05-07T20:33:20.1783869Z     contiguous=False,
2025-05-07T20:33:20.1784096Z     compiled=True,
2025-05-07T20:33:20.1784304Z )
2025-05-07T20:33:20.1784625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.1785121Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:20.1785394Z 
2025-05-07T20:33:20.1785479Z     @given(
2025-05-07T20:33:20.1785706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.1786025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.1786331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.1786663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.1786997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.1787280Z     )
2025-05-07T20:33:20.1787626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.1788072Z     def test_silu_mul_quant(
2025-05-07T20:33:20.1788316Z         self,
2025-05-07T20:33:20.1788510Z         T: int,
2025-05-07T20:33:20.1788709Z         D: int,
2025-05-07T20:33:20.1788931Z         scale_ub: Optional[float],
2025-05-07T20:33:20.1789196Z         contiguous: bool,
2025-05-07T20:33:20.1789436Z         compiled: bool,
2025-05-07T20:33:20.1789663Z     ) -> None:
2025-05-07T20:33:20.1789874Z         torch.manual_seed(2025)
2025-05-07T20:33:20.1790114Z     
2025-05-07T20:33:20.1790387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.1790723Z     
2025-05-07T20:33:20.1790915Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.1791200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.1791504Z         x = x_sign * x_clamp
2025-05-07T20:33:20.1791758Z         x0 = x[:, :D]
2025-05-07T20:33:20.1791972Z         x1 = x[:, D:]
2025-05-07T20:33:20.1792183Z     
2025-05-07T20:33:20.1792362Z         if contiguous:
2025-05-07T20:33:20.1792597Z             x0 = x0.contiguous()
2025-05-07T20:33:20.1792858Z             x1 = x1.contiguous()
2025-05-07T20:33:20.1793097Z     
2025-05-07T20:33:20.1793286Z         if scale_ub is not None:
2025-05-07T20:33:20.1793556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.1793883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.1794189Z             )
2025-05-07T20:33:20.1794385Z         else:
2025-05-07T20:33:20.1794588Z             scale_ub_tensor = None
2025-05-07T20:33:20.1794845Z     
2025-05-07T20:33:20.1795076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.1795393Z             op = silu_mul_quant
2025-05-07T20:33:20.1795640Z             if compiled:
2025-05-07T20:33:20.1795943Z                 op = torch.compile(op)
2025-05-07T20:33:20.1796244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.1796572Z     
2025-05-07T20:33:20.1796803Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.1796969Z 
2025-05-07T20:33:20.1797074Z moe/activation_test.py:117: 
2025-05-07T20:33:20.1797361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.1797730Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.1798014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.1798559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.1799119Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.1799774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.1800463Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.1800995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.1801679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.1802382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.1802920Z     kernel = self.compile(
2025-05-07T20:33:20.1803453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.1804111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.1804509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.1804736Z 
2025-05-07T20:33:20.1804944Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55ca40>
2025-05-07T20:33:20.1806017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.1807395Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf559440>}
2025-05-07T20:33:20.1808789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.1809815Z context = <triton._C.libtriton.ir.context object at 0x7f33f069bfb0>
2025-05-07T20:33:20.1810104Z 
2025-05-07T20:33:20.1810267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.1810785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.1811254Z                            module_map=module_map)
2025-05-07T20:33:20.1811626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.1811980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.1812246Z E       ^
2025-05-07T20:33:20.1812708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.1813157Z 
2025-05-07T20:33:20.1813566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.1814076Z 
2025-05-07T20:33:20.3408741Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.3409372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.3409949Z     T=16384,
2025-05-07T20:33:20.3410263Z     D=5120,
2025-05-07T20:33:20.3410540Z     scale_ub=1200.0,
2025-05-07T20:33:20.3410849Z     contiguous=False,
2025-05-07T20:33:20.3411105Z     compiled=False,
2025-05-07T20:33:20.3411319Z )
2025-05-07T20:33:20.3411649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.3412302Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:20.3412658Z 
2025-05-07T20:33:20.3412739Z     @given(
2025-05-07T20:33:20.3412980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.3413302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.3413673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.3414009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.3414342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.3414633Z     )
2025-05-07T20:33:20.3414986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.3415433Z     def test_silu_mul_quant(
2025-05-07T20:33:20.3415673Z         self,
2025-05-07T20:33:20.3415877Z         T: int,
2025-05-07T20:33:20.3416080Z         D: int,
2025-05-07T20:33:20.3416295Z         scale_ub: Optional[float],
2025-05-07T20:33:20.3416572Z         contiguous: bool,
2025-05-07T20:33:20.3416815Z         compiled: bool,
2025-05-07T20:33:20.3417054Z     ) -> None:
2025-05-07T20:33:20.3417269Z         torch.manual_seed(2025)
2025-05-07T20:33:20.3417581Z     
2025-05-07T20:33:20.3417859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.3418206Z     
2025-05-07T20:33:20.3418420Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.3418753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.3419062Z         x = x_sign * x_clamp
2025-05-07T20:33:20.3419309Z         x0 = x[:, :D]
2025-05-07T20:33:20.3419538Z         x1 = x[:, D:]
2025-05-07T20:33:20.3419743Z     
2025-05-07T20:33:20.3419932Z         if contiguous:
2025-05-07T20:33:20.3420169Z             x0 = x0.contiguous()
2025-05-07T20:33:20.3420429Z             x1 = x1.contiguous()
2025-05-07T20:33:20.3420669Z     
2025-05-07T20:33:20.3420868Z         if scale_ub is not None:
2025-05-07T20:33:20.3421137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.3421475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.3421797Z             )
2025-05-07T20:33:20.3421993Z         else:
2025-05-07T20:33:20.3422207Z             scale_ub_tensor = None
2025-05-07T20:33:20.3422467Z     
2025-05-07T20:33:20.3422702Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.3423019Z             op = silu_mul_quant
2025-05-07T20:33:20.3423275Z             if compiled:
2025-05-07T20:33:20.3423524Z                 op = torch.compile(op)
2025-05-07T20:33:20.3423820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.3424106Z     
2025-05-07T20:33:20.3424301Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.3424464Z 
2025-05-07T20:33:20.3424566Z moe/activation_test.py:117: 
2025-05-07T20:33:20.3424865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.3425207Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.3425602Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.3426305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.3427005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.3427548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.3428230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.3428943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.3429481Z     kernel = self.compile(
2025-05-07T20:33:20.3430026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.3430682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.3431083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.3431315Z 
2025-05-07T20:33:20.3431588Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55fc50>
2025-05-07T20:33:20.3432951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.3434369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf55a340>}
2025-05-07T20:33:20.3435786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.3436843Z context = <triton._C.libtriton.ir.context object at 0x7f32cf3b73b0>
2025-05-07T20:33:20.3437133Z 
2025-05-07T20:33:20.3437305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.3437877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.3438352Z                            module_map=module_map)
2025-05-07T20:33:20.3438720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.3439077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.3439333Z E       ^
2025-05-07T20:33:20.3439802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.3440255Z 
2025-05-07T20:33:20.3440675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.3441185Z 
2025-05-07T20:33:20.3441288Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.3441701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.3442104Z     T=16384,
2025-05-07T20:33:20.3442301Z     D=5120,
2025-05-07T20:33:20.3442497Z     scale_ub=1200.0,
2025-05-07T20:33:20.3442719Z     contiguous=True,
2025-05-07T20:33:20.3442944Z     compiled=True,
2025-05-07T20:33:20.3443141Z )
2025-05-07T20:33:20.3443458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.3443952Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:20.3444227Z 
2025-05-07T20:33:20.3444306Z     @given(
2025-05-07T20:33:20.3444537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.3444853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.3445157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.3445488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.3445820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.3446110Z     )
2025-05-07T20:33:20.3446455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.3446898Z     def test_silu_mul_quant(
2025-05-07T20:33:20.3447140Z         self,
2025-05-07T20:33:20.3447334Z         T: int,
2025-05-07T20:33:20.3447534Z         D: int,
2025-05-07T20:33:20.3447752Z         scale_ub: Optional[float],
2025-05-07T20:33:20.3448024Z         contiguous: bool,
2025-05-07T20:33:20.3448265Z         compiled: bool,
2025-05-07T20:33:20.3448489Z     ) -> None:
2025-05-07T20:33:20.3448704Z         torch.manual_seed(2025)
2025-05-07T20:33:20.3448950Z     
2025-05-07T20:33:20.3449229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.3449571Z     
2025-05-07T20:33:20.3449771Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.3450063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.3450378Z         x = x_sign * x_clamp
2025-05-07T20:33:20.3450615Z         x0 = x[:, :D]
2025-05-07T20:33:20.3450831Z         x1 = x[:, D:]
2025-05-07T20:33:20.3451042Z     
2025-05-07T20:33:20.3451223Z         if contiguous:
2025-05-07T20:33:20.3451511Z             x0 = x0.contiguous()
2025-05-07T20:33:20.3451806Z             x1 = x1.contiguous()
2025-05-07T20:33:20.3452050Z     
2025-05-07T20:33:20.3452244Z         if scale_ub is not None:
2025-05-07T20:33:20.3452518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.3452889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.3453198Z             )
2025-05-07T20:33:20.3453395Z         else:
2025-05-07T20:33:20.3453608Z             scale_ub_tensor = None
2025-05-07T20:33:20.3453863Z     
2025-05-07T20:33:20.3454096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.3454408Z             op = silu_mul_quant
2025-05-07T20:33:20.3454659Z             if compiled:
2025-05-07T20:33:20.3454911Z                 op = torch.compile(op)
2025-05-07T20:33:20.3455205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.3455491Z     
2025-05-07T20:33:20.3455695Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.3455857Z 
2025-05-07T20:33:20.3455967Z moe/activation_test.py:117: 
2025-05-07T20:33:20.3456301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.3456639Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.3456934Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.3457490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.3458053Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.3458713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.3459400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.3459936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.3460616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.3461286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.3461820Z     kernel = self.compile(
2025-05-07T20:33:20.3462359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.3463020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.3463419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.3463654Z 
2025-05-07T20:33:20.3463862Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55fe00>
2025-05-07T20:33:20.3464949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.3466663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf55b9c0>}
2025-05-07T20:33:20.3468025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.3469061Z context = <triton._C.libtriton.ir.context object at 0x7f33f0681a70>
2025-05-07T20:33:20.3469349Z 
2025-05-07T20:33:20.3469518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.3470046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.3470518Z                            module_map=module_map)
2025-05-07T20:33:20.3470884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.3471242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.3471504Z E       ^
2025-05-07T20:33:20.3472059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.3472605Z 
2025-05-07T20:33:20.3473111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.3473845Z 
2025-05-07T20:33:20.5181147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.5181822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.5182395Z     T=16384,
2025-05-07T20:33:20.5182660Z     D=5120,
2025-05-07T20:33:20.5182929Z     scale_ub=None,
2025-05-07T20:33:20.5183152Z     contiguous=False,
2025-05-07T20:33:20.5183382Z     compiled=True,
2025-05-07T20:33:20.5183595Z )
2025-05-07T20:33:20.5183920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.5184425Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:20.5184751Z 
2025-05-07T20:33:20.5184831Z     @given(
2025-05-07T20:33:20.5185090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.5185565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.5185874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.5186214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.5186548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.5186836Z     )
2025-05-07T20:33:20.5187187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.5187632Z     def test_silu_mul_quant(
2025-05-07T20:33:20.5187874Z         self,
2025-05-07T20:33:20.5188078Z         T: int,
2025-05-07T20:33:20.5188278Z         D: int,
2025-05-07T20:33:20.5188507Z         scale_ub: Optional[float],
2025-05-07T20:33:20.5188829Z         contiguous: bool,
2025-05-07T20:33:20.5189069Z         compiled: bool,
2025-05-07T20:33:20.5189297Z     ) -> None:
2025-05-07T20:33:20.5189512Z         torch.manual_seed(2025)
2025-05-07T20:33:20.5189764Z     
2025-05-07T20:33:20.5190052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.5190397Z     
2025-05-07T20:33:20.5190602Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.5190902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.5197432Z         x = x_sign * x_clamp
2025-05-07T20:33:20.5197692Z         x0 = x[:, :D]
2025-05-07T20:33:20.5197915Z         x1 = x[:, D:]
2025-05-07T20:33:20.5198129Z     
2025-05-07T20:33:20.5198325Z         if contiguous:
2025-05-07T20:33:20.5198554Z             x0 = x0.contiguous()
2025-05-07T20:33:20.5198817Z             x1 = x1.contiguous()
2025-05-07T20:33:20.5199062Z     
2025-05-07T20:33:20.5199254Z         if scale_ub is not None:
2025-05-07T20:33:20.5199529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.5199875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.5200186Z             )
2025-05-07T20:33:20.5200379Z         else:
2025-05-07T20:33:20.5200596Z             scale_ub_tensor = None
2025-05-07T20:33:20.5200850Z     
2025-05-07T20:33:20.5201097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.5201417Z             op = silu_mul_quant
2025-05-07T20:33:20.5201681Z             if compiled:
2025-05-07T20:33:20.5201925Z                 op = torch.compile(op)
2025-05-07T20:33:20.5202230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.5202508Z     
2025-05-07T20:33:20.5202697Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.5202870Z 
2025-05-07T20:33:20.5202971Z moe/activation_test.py:117: 
2025-05-07T20:33:20.5203286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.5203624Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.5203903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.5204467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.5205027Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.5205851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.5206542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.5207141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.5207821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.5208494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.5209066Z     kernel = self.compile(
2025-05-07T20:33:20.5209605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.5210251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.5210651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.5210890Z 
2025-05-07T20:33:20.5211137Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06cf8c0>
2025-05-07T20:33:20.5212221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.5213603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf368c20>}
2025-05-07T20:33:20.5214936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.5215967Z context = <triton._C.libtriton.ir.context object at 0x7f32cf3f0bb0>
2025-05-07T20:33:20.5216263Z 
2025-05-07T20:33:20.5216433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.5216964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.5217435Z                            module_map=module_map)
2025-05-07T20:33:20.5217804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.5218158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.5218419Z E       ^
2025-05-07T20:33:20.5218935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.5219389Z 
2025-05-07T20:33:20.5219801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.5220309Z 
2025-05-07T20:33:20.5220421Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.5220837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.5221251Z     T=2048,
2025-05-07T20:33:20.5221448Z     D=5120,
2025-05-07T20:33:20.5221644Z     scale_ub=None,
2025-05-07T20:33:20.5221865Z     contiguous=False,
2025-05-07T20:33:20.5222092Z     compiled=True,
2025-05-07T20:33:20.5222292Z )
2025-05-07T20:33:20.6126825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.6127604Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:20.6128012Z 
2025-05-07T20:33:20.6128132Z     @given(
2025-05-07T20:33:20.6128461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.6129341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.6129975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.6130642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.6131300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.6131881Z     )
2025-05-07T20:33:20.6132820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.6133817Z     def test_silu_mul_quant(
2025-05-07T20:33:20.6134311Z         self,
2025-05-07T20:33:20.6134708Z         T: int,
2025-05-07T20:33:20.6135103Z         D: int,
2025-05-07T20:33:20.6135661Z         scale_ub: Optional[float],
2025-05-07T20:33:20.6136207Z         contiguous: bool,
2025-05-07T20:33:20.6136685Z         compiled: bool,
2025-05-07T20:33:20.6137143Z     ) -> None:
2025-05-07T20:33:20.6137585Z         torch.manual_seed(2025)
2025-05-07T20:33:20.6138076Z     
2025-05-07T20:33:20.6138557Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.6138961Z     
2025-05-07T20:33:20.6139158Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.6139456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.6139778Z         x = x_sign * x_clamp
2025-05-07T20:33:20.6140024Z         x0 = x[:, :D]
2025-05-07T20:33:20.6140241Z         x1 = x[:, D:]
2025-05-07T20:33:20.6140454Z     
2025-05-07T20:33:20.6140662Z         if contiguous:
2025-05-07T20:33:20.6140905Z             x0 = x0.contiguous()
2025-05-07T20:33:20.6141234Z             x1 = x1.contiguous()
2025-05-07T20:33:20.6141483Z     
2025-05-07T20:33:20.6141685Z         if scale_ub is not None:
2025-05-07T20:33:20.6141974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.6142329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.6142654Z             )
2025-05-07T20:33:20.6142857Z         else:
2025-05-07T20:33:20.6143082Z             scale_ub_tensor = None
2025-05-07T20:33:20.6143336Z     
2025-05-07T20:33:20.6143572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.6143896Z             op = silu_mul_quant
2025-05-07T20:33:20.6144152Z             if compiled:
2025-05-07T20:33:20.6144408Z                 op = torch.compile(op)
2025-05-07T20:33:20.6144706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.6144991Z     
2025-05-07T20:33:20.6145197Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.6145370Z 
2025-05-07T20:33:20.6145481Z moe/activation_test.py:117: 
2025-05-07T20:33:20.6145788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.6146134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.6146422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.6146986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.6147557Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.6148225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.6148918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.6149454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.6150139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.6150816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.6151349Z     kernel = self.compile(
2025-05-07T20:33:20.6151898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.6152560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.6152964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.6153197Z 
2025-05-07T20:33:20.6153408Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06ccb90>
2025-05-07T20:33:20.6154498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.6156005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf3699e0>}
2025-05-07T20:33:20.6157393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.6158467Z context = <triton._C.libtriton.ir.context object at 0x7f32cf31beb0>
2025-05-07T20:33:20.6158760Z 
2025-05-07T20:33:20.6158931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.6159464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.6159969Z                            module_map=module_map)
2025-05-07T20:33:20.6160345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.6160714Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.6160978Z E       ^
2025-05-07T20:33:20.6161504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.6161960Z 
2025-05-07T20:33:20.6162383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.6162899Z 
2025-05-07T20:33:20.6163010Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.6163436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.6163854Z     T=2048,
2025-05-07T20:33:20.6164053Z     D=5120,
2025-05-07T20:33:20.6164256Z     scale_ub=1200.0,
2025-05-07T20:33:20.6164488Z     contiguous=False,
2025-05-07T20:33:20.6164718Z     compiled=True,
2025-05-07T20:33:20.6164940Z )
2025-05-07T20:33:20.6165269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.6166025Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:20.6166308Z 
2025-05-07T20:33:20.6166399Z     @given(
2025-05-07T20:33:20.6166647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.6166970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.6167284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.6167620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.6167958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.6168245Z     )
2025-05-07T20:33:20.6168628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.6169113Z     def test_silu_mul_quant(
2025-05-07T20:33:20.6169358Z         self,
2025-05-07T20:33:20.6169565Z         T: int,
2025-05-07T20:33:20.6169768Z         D: int,
2025-05-07T20:33:20.6169990Z         scale_ub: Optional[float],
2025-05-07T20:33:20.6170265Z         contiguous: bool,
2025-05-07T20:33:20.6170518Z         compiled: bool,
2025-05-07T20:33:20.6170751Z     ) -> None:
2025-05-07T20:33:20.6170971Z         torch.manual_seed(2025)
2025-05-07T20:33:20.6171227Z     
2025-05-07T20:33:20.6171515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.6171859Z     
2025-05-07T20:33:20.6172059Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.6172358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.6172675Z         x = x_sign * x_clamp
2025-05-07T20:33:20.6172924Z         x0 = x[:, :D]
2025-05-07T20:33:20.6173151Z         x1 = x[:, D:]
2025-05-07T20:33:20.6173362Z     
2025-05-07T20:33:20.6173560Z         if contiguous:
2025-05-07T20:33:20.6173797Z             x0 = x0.contiguous()
2025-05-07T20:33:20.6174059Z             x1 = x1.contiguous()
2025-05-07T20:33:20.6174308Z     
2025-05-07T20:33:20.6174507Z         if scale_ub is not None:
2025-05-07T20:33:20.6174781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.6175127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.6175441Z             )
2025-05-07T20:33:20.6175797Z         else:
2025-05-07T20:33:20.6176015Z             scale_ub_tensor = None
2025-05-07T20:33:20.6176280Z     
2025-05-07T20:33:20.6176511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.6176903Z             op = silu_mul_quant
2025-05-07T20:33:20.6177157Z             if compiled:
2025-05-07T20:33:20.6177408Z                 op = torch.compile(op)
2025-05-07T20:33:20.6177707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.6177989Z     
2025-05-07T20:33:20.6178194Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.6178359Z 
2025-05-07T20:33:20.6178458Z moe/activation_test.py:117: 
2025-05-07T20:33:20.6178760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.6179103Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.6179385Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.6179950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.6180523Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.6181244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.6181936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.6182481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.6183168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.6183834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.6184366Z     kernel = self.compile(
2025-05-07T20:33:20.6184913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.6185568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.6185971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.6186213Z 
2025-05-07T20:33:20.6186427Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06cd730>
2025-05-07T20:33:20.6187516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.6188952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf36ab60>}
2025-05-07T20:33:20.6190293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.6191327Z context = <triton._C.libtriton.ir.context object at 0x7f32cf4b29f0>
2025-05-07T20:33:20.6191630Z 
2025-05-07T20:33:20.6191800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.6192330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.6192800Z                            module_map=module_map)
2025-05-07T20:33:20.6193171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.6193539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.6193810Z E       ^
2025-05-07T20:33:20.6194271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.6194725Z 
2025-05-07T20:33:20.6195140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.6195651Z 
2025-05-07T20:33:20.7948358Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.7949287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.7949890Z     T=4096,
2025-05-07T20:33:20.7950085Z     D=5120,
2025-05-07T20:33:20.7950285Z     scale_ub=1200.0,
2025-05-07T20:33:20.7950510Z     contiguous=True,
2025-05-07T20:33:20.7950808Z     compiled=True,
2025-05-07T20:33:20.7951013Z )
2025-05-07T20:33:20.7951334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.7951831Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:20.7952119Z 
2025-05-07T20:33:20.7952198Z     @given(
2025-05-07T20:33:20.7952430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.7952742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.7953042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.7953374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.7953702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.7953989Z     )
2025-05-07T20:33:20.7954342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.7954842Z     def test_silu_mul_quant(
2025-05-07T20:33:20.7955083Z         self,
2025-05-07T20:33:20.7955284Z         T: int,
2025-05-07T20:33:20.7955479Z         D: int,
2025-05-07T20:33:20.7955691Z         scale_ub: Optional[float],
2025-05-07T20:33:20.7956029Z         contiguous: bool,
2025-05-07T20:33:20.7956272Z         compiled: bool,
2025-05-07T20:33:20.7956499Z     ) -> None:
2025-05-07T20:33:20.7956711Z         torch.manual_seed(2025)
2025-05-07T20:33:20.7956960Z     
2025-05-07T20:33:20.7957229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.7957572Z     
2025-05-07T20:33:20.7957768Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.7958057Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.7958366Z         x = x_sign * x_clamp
2025-05-07T20:33:20.7958611Z         x0 = x[:, :D]
2025-05-07T20:33:20.7958859Z         x1 = x[:, D:]
2025-05-07T20:33:20.7959084Z     
2025-05-07T20:33:20.7959274Z         if contiguous:
2025-05-07T20:33:20.7959510Z             x0 = x0.contiguous()
2025-05-07T20:33:20.7959761Z             x1 = x1.contiguous()
2025-05-07T20:33:20.7960004Z     
2025-05-07T20:33:20.7960201Z         if scale_ub is not None:
2025-05-07T20:33:20.7960469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.7960804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.7961111Z             )
2025-05-07T20:33:20.7961302Z         else:
2025-05-07T20:33:20.7961515Z             scale_ub_tensor = None
2025-05-07T20:33:20.7961766Z     
2025-05-07T20:33:20.7961999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.7962309Z             op = silu_mul_quant
2025-05-07T20:33:20.7962557Z             if compiled:
2025-05-07T20:33:20.7962803Z                 op = torch.compile(op)
2025-05-07T20:33:20.7963091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.7963370Z     
2025-05-07T20:33:20.7963564Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.7963728Z 
2025-05-07T20:33:20.7963828Z moe/activation_test.py:117: 
2025-05-07T20:33:20.7964126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.7964457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.7964731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.7965292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:20.7966111Z     return fn(*args, **kwargs)
2025-05-07T20:33:20.7966767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.7967446Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.7967978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.7968744Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.7969499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.7970080Z     kernel = self.compile(
2025-05-07T20:33:20.7970618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.7971265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.7971660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.7971893Z 
2025-05-07T20:33:20.7972101Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf477ad0>
2025-05-07T20:33:20.7973191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.7974626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b8180>}
2025-05-07T20:33:20.7975975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.7976992Z context = <triton._C.libtriton.ir.context object at 0x7f32cf43bbb0>
2025-05-07T20:33:20.7977282Z 
2025-05-07T20:33:20.7977449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.7977976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.7978448Z                            module_map=module_map)
2025-05-07T20:33:20.7978842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.7979213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.7979475Z E       ^
2025-05-07T20:33:20.7980087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.7980547Z 
2025-05-07T20:33:20.7980958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.7981469Z 
2025-05-07T20:33:20.7981573Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.7981990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.7982390Z     T=128,
2025-05-07T20:33:20.7982577Z     D=5120,
2025-05-07T20:33:20.7982769Z     scale_ub=1200.0,
2025-05-07T20:33:20.7982987Z     contiguous=False,
2025-05-07T20:33:20.7983209Z     compiled=True,
2025-05-07T20:33:20.7983418Z )
2025-05-07T20:33:21.0645583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.0646372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:21.0646789Z 
2025-05-07T20:33:21.0646916Z     @given(
2025-05-07T20:33:21.0647234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.0647672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.0648010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.0648349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.0648676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.0648973Z     )
2025-05-07T20:33:21.0649331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.0649779Z     def test_silu_mul_quant(
2025-05-07T20:33:21.0650032Z         self,
2025-05-07T20:33:21.0650232Z         T: int,
2025-05-07T20:33:21.0650439Z         D: int,
2025-05-07T20:33:21.0650662Z         scale_ub: Optional[float],
2025-05-07T20:33:21.0650943Z         contiguous: bool,
2025-05-07T20:33:21.0651186Z         compiled: bool,
2025-05-07T20:33:21.0651595Z     ) -> None:
2025-05-07T20:33:21.0651819Z         torch.manual_seed(2025)
2025-05-07T20:33:21.0652070Z     
2025-05-07T20:33:21.0652344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.0652757Z     
2025-05-07T20:33:21.0652961Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.0653256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.0653578Z         x = x_sign * x_clamp
2025-05-07T20:33:21.0653827Z         x0 = x[:, :D]
2025-05-07T20:33:21.0654045Z         x1 = x[:, D:]
2025-05-07T20:33:21.0654259Z     
2025-05-07T20:33:21.0654455Z         if contiguous:
2025-05-07T20:33:21.0654688Z             x0 = x0.contiguous()
2025-05-07T20:33:21.0654957Z             x1 = x1.contiguous()
2025-05-07T20:33:21.0655209Z     
2025-05-07T20:33:21.0655401Z         if scale_ub is not None:
2025-05-07T20:33:21.0655681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.0656028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.0656349Z             )
2025-05-07T20:33:21.0656551Z         else:
2025-05-07T20:33:21.0656838Z             scale_ub_tensor = None
2025-05-07T20:33:21.0657094Z     
2025-05-07T20:33:21.0657339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.0657667Z             op = silu_mul_quant
2025-05-07T20:33:21.0657930Z             if compiled:
2025-05-07T20:33:21.0658182Z                 op = torch.compile(op)
2025-05-07T20:33:21.0658494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.0658785Z     
2025-05-07T20:33:21.0659022Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.0659200Z 
2025-05-07T20:33:21.0659306Z moe/activation_test.py:117: 
2025-05-07T20:33:21.0659616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.0659964Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.0660250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.0666885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:21.0667499Z     return fn(*args, **kwargs)
2025-05-07T20:33:21.0668172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.0668868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.0669411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.0670093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.0670763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.0671302Z     kernel = self.compile(
2025-05-07T20:33:21.0671853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.0672508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.0672922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.0673157Z 
2025-05-07T20:33:21.0673374Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf475340>
2025-05-07T20:33:21.0674464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.0675912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b8ea0>}
2025-05-07T20:33:21.0677259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.0678486Z context = <triton._C.libtriton.ir.context object at 0x7f33f003e970>
2025-05-07T20:33:21.0678780Z 
2025-05-07T20:33:21.0678955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.0679541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.0680019Z                            module_map=module_map)
2025-05-07T20:33:21.0680390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.0680758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.0681028Z E       ^
2025-05-07T20:33:21.0681501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.0681956Z 
2025-05-07T20:33:21.0682383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.0682896Z 
2025-05-07T20:33:21.0683006Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.0683505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.0683915Z     T=16384,
2025-05-07T20:33:21.0684122Z     D=7168,
2025-05-07T20:33:21.0684341Z     scale_ub=1200.0,
2025-05-07T20:33:21.0684574Z     contiguous=True,
2025-05-07T20:33:21.0684805Z     compiled=True,
2025-05-07T20:33:21.0685021Z )
2025-05-07T20:33:21.0685345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.0685849Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:21.0686128Z 
2025-05-07T20:33:21.0686215Z     @given(
2025-05-07T20:33:21.0686443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.0686765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.0687079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.0687407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.0687741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.0688037Z     )
2025-05-07T20:33:21.0688394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.0688843Z     def test_silu_mul_quant(
2025-05-07T20:33:21.0689138Z         self,
2025-05-07T20:33:21.0689341Z         T: int,
2025-05-07T20:33:21.0689539Z         D: int,
2025-05-07T20:33:21.0689763Z         scale_ub: Optional[float],
2025-05-07T20:33:21.0690039Z         contiguous: bool,
2025-05-07T20:33:21.0690278Z         compiled: bool,
2025-05-07T20:33:21.0690513Z     ) -> None:
2025-05-07T20:33:21.0690733Z         torch.manual_seed(2025)
2025-05-07T20:33:21.0690980Z     
2025-05-07T20:33:21.0691261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.0691611Z     
2025-05-07T20:33:21.0691809Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.0692107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.0692424Z         x = x_sign * x_clamp
2025-05-07T20:33:21.0692671Z         x0 = x[:, :D]
2025-05-07T20:33:21.0692895Z         x1 = x[:, D:]
2025-05-07T20:33:21.0693109Z     
2025-05-07T20:33:21.0693295Z         if contiguous:
2025-05-07T20:33:21.0693528Z             x0 = x0.contiguous()
2025-05-07T20:33:21.0693794Z             x1 = x1.contiguous()
2025-05-07T20:33:21.0694037Z     
2025-05-07T20:33:21.0694230Z         if scale_ub is not None:
2025-05-07T20:33:21.0694509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.0694850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.0695159Z             )
2025-05-07T20:33:21.0695354Z         else:
2025-05-07T20:33:21.0695568Z             scale_ub_tensor = None
2025-05-07T20:33:21.0695818Z     
2025-05-07T20:33:21.0696050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.0696374Z             op = silu_mul_quant
2025-05-07T20:33:21.0696621Z             if compiled:
2025-05-07T20:33:21.0696872Z                 op = torch.compile(op)
2025-05-07T20:33:21.0697262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.0697536Z     
2025-05-07T20:33:21.0697731Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.0697895Z 
2025-05-07T20:33:21.0698045Z moe/activation_test.py:117: 
2025-05-07T20:33:21.0698346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.0698681Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.0698970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.0699535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:21.0700093Z     return fn(*args, **kwargs)
2025-05-07T20:33:21.0700759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.0701454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.0701996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.0702739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.0703411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.0703953Z     kernel = self.compile(
2025-05-07T20:33:21.0704489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.0705153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.0705562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.0705797Z 
2025-05-07T20:33:21.0706013Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf476540>
2025-05-07T20:33:21.0707096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.0708479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4ba0c0>}
2025-05-07T20:33:21.0709882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.0710914Z context = <triton._C.libtriton.ir.context object at 0x7f33f0034eb0>
2025-05-07T20:33:21.0711204Z 
2025-05-07T20:33:21.0711381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.0711902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.0712380Z                            module_map=module_map)
2025-05-07T20:33:21.0712752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.0713105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.0713382Z E       ^
2025-05-07T20:33:21.0713854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.0714310Z 
2025-05-07T20:33:21.0714728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.0715238Z 
2025-05-07T20:33:21.1941022Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.1941492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.1942093Z     T=16384,
2025-05-07T20:33:21.1942344Z     D=5120,
2025-05-07T20:33:21.1942539Z     scale_ub=1200.0,
2025-05-07T20:33:21.1942763Z     contiguous=True,
2025-05-07T20:33:21.1942981Z     compiled=False,
2025-05-07T20:33:21.1943194Z )
2025-05-07T20:33:21.1943519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.1944209Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:21.1944497Z 
2025-05-07T20:33:21.1944580Z     @given(
2025-05-07T20:33:21.1944817Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.1945200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.1945507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.1945839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.1946172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.1946462Z     )
2025-05-07T20:33:21.1946815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.1947263Z     def test_silu_mul_quant(
2025-05-07T20:33:21.1947512Z         self,
2025-05-07T20:33:21.1947713Z         T: int,
2025-05-07T20:33:21.1947918Z         D: int,
2025-05-07T20:33:21.1948142Z         scale_ub: Optional[float],
2025-05-07T20:33:21.1948415Z         contiguous: bool,
2025-05-07T20:33:21.1948690Z         compiled: bool,
2025-05-07T20:33:21.1948980Z     ) -> None:
2025-05-07T20:33:21.1949202Z         torch.manual_seed(2025)
2025-05-07T20:33:21.1949446Z     
2025-05-07T20:33:21.1949726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.1950070Z     
2025-05-07T20:33:21.1950271Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.1950558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.1950870Z         x = x_sign * x_clamp
2025-05-07T20:33:21.1951113Z         x0 = x[:, :D]
2025-05-07T20:33:21.1951328Z         x1 = x[:, D:]
2025-05-07T20:33:21.1951542Z     
2025-05-07T20:33:21.1951736Z         if contiguous:
2025-05-07T20:33:21.1951968Z             x0 = x0.contiguous()
2025-05-07T20:33:21.1952228Z             x1 = x1.contiguous()
2025-05-07T20:33:21.1952481Z     
2025-05-07T20:33:21.1952670Z         if scale_ub is not None:
2025-05-07T20:33:21.1952950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.1953296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.1953609Z             )
2025-05-07T20:33:21.1953810Z         else:
2025-05-07T20:33:21.1954027Z             scale_ub_tensor = None
2025-05-07T20:33:21.1954280Z     
2025-05-07T20:33:21.1954516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.1954837Z             op = silu_mul_quant
2025-05-07T20:33:21.1955093Z             if compiled:
2025-05-07T20:33:21.1955343Z                 op = torch.compile(op)
2025-05-07T20:33:21.1955641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1955993Z     
2025-05-07T20:33:21.1956188Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.1956358Z 
2025-05-07T20:33:21.1956463Z moe/activation_test.py:117: 
2025-05-07T20:33:21.1956771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1957103Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.1957398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1958093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.1958786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.1959372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.1960057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.1960728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.1961255Z     kernel = self.compile(
2025-05-07T20:33:21.1961805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.1962467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.1962927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1963194Z 
2025-05-07T20:33:21.1963406Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf476660>
2025-05-07T20:33:21.1964488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.1966086Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b9a80>}
2025-05-07T20:33:21.1967426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.1968450Z context = <triton._C.libtriton.ir.context object at 0x7f33f005dc70>
2025-05-07T20:33:21.1968739Z 
2025-05-07T20:33:21.1968913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.1969535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.1970016Z                            module_map=module_map)
2025-05-07T20:33:21.1970377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.1970737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.1971003Z E       ^
2025-05-07T20:33:21.1971465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.1971924Z 
2025-05-07T20:33:21.1972338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.1972857Z 
2025-05-07T20:33:21.1972962Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.1973382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.1973791Z     T=1,
2025-05-07T20:33:21.1973980Z     D=7168,
2025-05-07T20:33:21.1974186Z     scale_ub=1200.0,
2025-05-07T20:33:21.1974406Z     contiguous=False,
2025-05-07T20:33:21.1974639Z     compiled=False,
2025-05-07T20:33:21.1974851Z )
2025-05-07T20:33:21.1975167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.1975661Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:21.1975937Z 
2025-05-07T20:33:21.1976019Z     @given(
2025-05-07T20:33:21.1976253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.1976574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.1976889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.1977220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.1977549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.1977840Z     )
2025-05-07T20:33:21.1978196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.1978648Z     def test_silu_mul_quant(
2025-05-07T20:33:21.1978900Z         self,
2025-05-07T20:33:21.1979103Z         T: int,
2025-05-07T20:33:21.1979306Z         D: int,
2025-05-07T20:33:21.1979532Z         scale_ub: Optional[float],
2025-05-07T20:33:21.1979810Z         contiguous: bool,
2025-05-07T20:33:21.1980054Z         compiled: bool,
2025-05-07T20:33:21.1980281Z     ) -> None:
2025-05-07T20:33:21.1980498Z         torch.manual_seed(2025)
2025-05-07T20:33:21.1980746Z     
2025-05-07T20:33:21.1981018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.1981370Z     
2025-05-07T20:33:21.1981572Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.1981864Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.1982188Z         x = x_sign * x_clamp
2025-05-07T20:33:21.1982429Z         x0 = x[:, :D]
2025-05-07T20:33:21.1982643Z         x1 = x[:, D:]
2025-05-07T20:33:21.1982856Z     
2025-05-07T20:33:21.1983182Z         if contiguous:
2025-05-07T20:33:21.1983416Z             x0 = x0.contiguous()
2025-05-07T20:33:21.1983680Z             x1 = x1.contiguous()
2025-05-07T20:33:21.1983930Z     
2025-05-07T20:33:21.1984181Z         if scale_ub is not None:
2025-05-07T20:33:21.1984466Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.1984803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.1985117Z             )
2025-05-07T20:33:21.1985313Z         else:
2025-05-07T20:33:21.1985532Z             scale_ub_tensor = None
2025-05-07T20:33:21.1985791Z     
2025-05-07T20:33:21.1986029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.1986353Z             op = silu_mul_quant
2025-05-07T20:33:21.1986607Z             if compiled:
2025-05-07T20:33:21.1986853Z                 op = torch.compile(op)
2025-05-07T20:33:21.1987156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1987443Z     
2025-05-07T20:33:21.1987646Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.1987812Z 
2025-05-07T20:33:21.1987962Z moe/activation_test.py:117: 
2025-05-07T20:33:21.1988265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1988598Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.1988908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1989630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.1990324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.1990863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.1991546Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.1992209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.1992747Z     kernel = self.compile(
2025-05-07T20:33:21.1993288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.1993944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.1994346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1994577Z 
2025-05-07T20:33:21.1994783Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b5d60>
2025-05-07T20:33:21.1995912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.1997287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1d80e0>}
2025-05-07T20:33:21.1998639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.1999674Z context = <triton._C.libtriton.ir.context object at 0x7f32cf069db0>
2025-05-07T20:33:21.1999963Z 
2025-05-07T20:33:21.2000128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.2000652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.2001128Z                            module_map=module_map)
2025-05-07T20:33:21.2001499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.2001848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.2002115Z E       ^
2025-05-07T20:33:21.2002591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.2003041Z 
2025-05-07T20:33:21.2003549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.2004070Z 
2025-05-07T20:33:21.3743040Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.3744085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.3744842Z     T=4096,
2025-05-07T20:33:21.3745193Z     D=7168,
2025-05-07T20:33:21.3745548Z     scale_ub=1200.0,
2025-05-07T20:33:21.3745969Z     contiguous=False,
2025-05-07T20:33:21.3746377Z     compiled=True,
2025-05-07T20:33:21.3746746Z )
2025-05-07T20:33:21.3747334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.3748247Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:21.3748753Z 
2025-05-07T20:33:21.3748900Z     @given(
2025-05-07T20:33:21.3749287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.3749617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.3749935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.3750332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.3750671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.3750964Z     )
2025-05-07T20:33:21.3751317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.3751766Z     def test_silu_mul_quant(
2025-05-07T20:33:21.3752019Z         self,
2025-05-07T20:33:21.3752213Z         T: int,
2025-05-07T20:33:21.3752416Z         D: int,
2025-05-07T20:33:21.3752637Z         scale_ub: Optional[float],
2025-05-07T20:33:21.3752906Z         contiguous: bool,
2025-05-07T20:33:21.3753154Z         compiled: bool,
2025-05-07T20:33:21.3753385Z     ) -> None:
2025-05-07T20:33:21.3753606Z         torch.manual_seed(2025)
2025-05-07T20:33:21.3753854Z     
2025-05-07T20:33:21.3754130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.3754472Z     
2025-05-07T20:33:21.3754667Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.3754973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.3755293Z         x = x_sign * x_clamp
2025-05-07T20:33:21.3755534Z         x0 = x[:, :D]
2025-05-07T20:33:21.3755814Z         x1 = x[:, D:]
2025-05-07T20:33:21.3756025Z     
2025-05-07T20:33:21.3756209Z         if contiguous:
2025-05-07T20:33:21.3756440Z             x0 = x0.contiguous()
2025-05-07T20:33:21.3756699Z             x1 = x1.contiguous()
2025-05-07T20:33:21.3756937Z     
2025-05-07T20:33:21.3757133Z         if scale_ub is not None:
2025-05-07T20:33:21.3757408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.3757742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.3758062Z             )
2025-05-07T20:33:21.3758258Z         else:
2025-05-07T20:33:21.3758472Z             scale_ub_tensor = None
2025-05-07T20:33:21.3758727Z     
2025-05-07T20:33:21.3758979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.3759329Z             op = silu_mul_quant
2025-05-07T20:33:21.3759582Z             if compiled:
2025-05-07T20:33:21.3759833Z                 op = torch.compile(op)
2025-05-07T20:33:21.3760131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.3760406Z     
2025-05-07T20:33:21.3760601Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.3760763Z 
2025-05-07T20:33:21.3760871Z moe/activation_test.py:117: 
2025-05-07T20:33:21.3761162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.3761499Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.3761781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.3762335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:21.3762899Z     return fn(*args, **kwargs)
2025-05-07T20:33:21.3763632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.3764386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.3764919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.3765811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.3766482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.3767009Z     kernel = self.compile(
2025-05-07T20:33:21.3767549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.3768205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.3768605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.3768836Z 
2025-05-07T20:33:21.3769046Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b5970>
2025-05-07T20:33:21.3770187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.3771568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1d9300>}
2025-05-07T20:33:21.3772911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.3773941Z context = <triton._C.libtriton.ir.context object at 0x7f32cf10ceb0>
2025-05-07T20:33:21.3774229Z 
2025-05-07T20:33:21.3774396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.3774925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.3775404Z                            module_map=module_map)
2025-05-07T20:33:21.3775762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.3776119Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.3776385Z E       ^
2025-05-07T20:33:21.3776851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.3777301Z 
2025-05-07T20:33:21.3777713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.3778228Z 
2025-05-07T20:33:21.3778331Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.3778753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.3779161Z     T=128,
2025-05-07T20:33:21.3779349Z     D=7168,
2025-05-07T20:33:21.3779544Z     scale_ub=1200.0,
2025-05-07T20:33:21.3779780Z     contiguous=False,
2025-05-07T20:33:21.3780002Z     compiled=True,
2025-05-07T20:33:21.3780205Z )
2025-05-07T20:33:21.4692858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.4693416Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:21.4693695Z 
2025-05-07T20:33:21.4693774Z     @given(
2025-05-07T20:33:21.4694063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.4694506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.4694911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.4695344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.4695695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.4695980Z     )
2025-05-07T20:33:21.4696322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.4696767Z     def test_silu_mul_quant(
2025-05-07T20:33:21.4697177Z         self,
2025-05-07T20:33:21.4697371Z         T: int,
2025-05-07T20:33:21.4697580Z         D: int,
2025-05-07T20:33:21.4697805Z         scale_ub: Optional[float],
2025-05-07T20:33:21.4698071Z         contiguous: bool,
2025-05-07T20:33:21.4698405Z         compiled: bool,
2025-05-07T20:33:21.4704391Z     ) -> None:
2025-05-07T20:33:21.4704619Z         torch.manual_seed(2025)
2025-05-07T20:33:21.4704859Z     
2025-05-07T20:33:21.4705130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.4705480Z     
2025-05-07T20:33:21.4705675Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.4705977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.4706286Z         x = x_sign * x_clamp
2025-05-07T20:33:21.4706524Z         x0 = x[:, :D]
2025-05-07T20:33:21.4706745Z         x1 = x[:, D:]
2025-05-07T20:33:21.4706957Z     
2025-05-07T20:33:21.4707148Z         if contiguous:
2025-05-07T20:33:21.4707385Z             x0 = x0.contiguous()
2025-05-07T20:33:21.4707664Z             x1 = x1.contiguous()
2025-05-07T20:33:21.4707911Z     
2025-05-07T20:33:21.4708206Z         if scale_ub is not None:
2025-05-07T20:33:21.4708491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.4708843Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.4709188Z             )
2025-05-07T20:33:21.4709410Z         else:
2025-05-07T20:33:21.4709630Z             scale_ub_tensor = None
2025-05-07T20:33:21.4709886Z     
2025-05-07T20:33:21.4710127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.4710449Z             op = silu_mul_quant
2025-05-07T20:33:21.4710699Z             if compiled:
2025-05-07T20:33:21.4710952Z                 op = torch.compile(op)
2025-05-07T20:33:21.4711251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.4711528Z     
2025-05-07T20:33:21.4711730Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.4711902Z 
2025-05-07T20:33:21.4712004Z moe/activation_test.py:117: 
2025-05-07T20:33:21.4712305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.4712640Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.4712923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.4713486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:21.4714044Z     return fn(*args, **kwargs)
2025-05-07T20:33:21.4714698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.4715384Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.4716010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.4716691Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.4717355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.4717893Z     kernel = self.compile(
2025-05-07T20:33:21.4718434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.4719107Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.4719540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.4719769Z 
2025-05-07T20:33:21.4719979Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b6c90>
2025-05-07T20:33:21.4721052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.4722485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1da020>}
2025-05-07T20:33:21.4724187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.4725255Z context = <triton._C.libtriton.ir.context object at 0x7f32cf0474b0>
2025-05-07T20:33:21.4725550Z 
2025-05-07T20:33:21.4725726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.4726256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.4726732Z                            module_map=module_map)
2025-05-07T20:33:21.4727101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.4727456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.4727726Z E       ^
2025-05-07T20:33:21.4728224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.4728684Z 
2025-05-07T20:33:21.4729173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.4729714Z 
2025-05-07T20:33:21.4729829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.4730245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.4730659Z     T=2048,
2025-05-07T20:33:21.4730851Z     D=7168,
2025-05-07T20:33:21.4731045Z     scale_ub=None,
2025-05-07T20:33:21.4731264Z     contiguous=True,
2025-05-07T20:33:21.4731493Z     compiled=True,
2025-05-07T20:33:21.4731691Z )
2025-05-07T20:33:21.4732011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.4732511Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:21.4732779Z 
2025-05-07T20:33:21.4732869Z     @given(
2025-05-07T20:33:21.4733103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.4733425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.4733741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.4734064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.4734402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.4734687Z     )
2025-05-07T20:33:21.4735032Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.4735489Z     def test_silu_mul_quant(
2025-05-07T20:33:21.4735732Z         self,
2025-05-07T20:33:21.4735930Z         T: int,
2025-05-07T20:33:21.4736131Z         D: int,
2025-05-07T20:33:21.4736354Z         scale_ub: Optional[float],
2025-05-07T20:33:21.4736621Z         contiguous: bool,
2025-05-07T20:33:21.4736859Z         compiled: bool,
2025-05-07T20:33:21.4737088Z     ) -> None:
2025-05-07T20:33:21.4737314Z         torch.manual_seed(2025)
2025-05-07T20:33:21.4737551Z     
2025-05-07T20:33:21.4737830Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.4738178Z     
2025-05-07T20:33:21.4738368Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.4738664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.4738980Z         x = x_sign * x_clamp
2025-05-07T20:33:21.4739218Z         x0 = x[:, :D]
2025-05-07T20:33:21.4739442Z         x1 = x[:, D:]
2025-05-07T20:33:21.4739649Z     
2025-05-07T20:33:21.4739834Z         if contiguous:
2025-05-07T20:33:21.4740076Z             x0 = x0.contiguous()
2025-05-07T20:33:21.4740339Z             x1 = x1.contiguous()
2025-05-07T20:33:21.4740572Z     
2025-05-07T20:33:21.4740773Z         if scale_ub is not None:
2025-05-07T20:33:21.4741052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.4741382Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.4741696Z             )
2025-05-07T20:33:21.4741894Z         else:
2025-05-07T20:33:21.4742111Z             scale_ub_tensor = None
2025-05-07T20:33:21.4742415Z     
2025-05-07T20:33:21.4742694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.4743011Z             op = silu_mul_quant
2025-05-07T20:33:21.4743260Z             if compiled:
2025-05-07T20:33:21.4743560Z                 op = torch.compile(op)
2025-05-07T20:33:21.4743855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.4744133Z     
2025-05-07T20:33:21.4744334Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.4744501Z 
2025-05-07T20:33:21.4744605Z moe/activation_test.py:117: 
2025-05-07T20:33:21.4744900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.4745233Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.4745516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.4746084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:21.4746635Z     return fn(*args, **kwargs)
2025-05-07T20:33:21.4747335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.4748031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.4748573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.4749314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.4749993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.4750523Z     kernel = self.compile(
2025-05-07T20:33:21.4751062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.4751717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.4752113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.4752345Z 
2025-05-07T20:33:21.4752556Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf01c500>
2025-05-07T20:33:21.4753645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.4755020Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1db240>}
2025-05-07T20:33:21.4756415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.4757434Z context = <triton._C.libtriton.ir.context object at 0x7f32cf09e3b0>
2025-05-07T20:33:21.4757725Z 
2025-05-07T20:33:21.4757891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.4758416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.4758886Z                            module_map=module_map)
2025-05-07T20:33:21.4759281Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.4759660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.4759920Z E       ^
2025-05-07T20:33:21.4760377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.4760831Z 
2025-05-07T20:33:21.4761243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.4761754Z 
2025-05-07T20:33:21.5444672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.5445166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.5445708Z     T=16384,
2025-05-07T20:33:21.5445907Z     D=5120,
2025-05-07T20:33:21.5446270Z     scale_ub=None,
2025-05-07T20:33:21.5446497Z     contiguous=False,
2025-05-07T20:33:21.5446727Z     compiled=False,
2025-05-07T20:33:21.5446934Z )
2025-05-07T20:33:21.5447259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.5447826Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:21.5448115Z 
2025-05-07T20:33:21.5448194Z     @given(
2025-05-07T20:33:21.5448430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.5448746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.5449080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.5449447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.5449772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.5450064Z     )
2025-05-07T20:33:21.5450412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.5450863Z     def test_silu_mul_quant(
2025-05-07T20:33:21.5451107Z         self,
2025-05-07T20:33:21.5451370Z         T: int,
2025-05-07T20:33:21.5451576Z         D: int,
2025-05-07T20:33:21.5451797Z         scale_ub: Optional[float],
2025-05-07T20:33:21.5452082Z         contiguous: bool,
2025-05-07T20:33:21.5452324Z         compiled: bool,
2025-05-07T20:33:21.5452547Z     ) -> None:
2025-05-07T20:33:21.5452766Z         torch.manual_seed(2025)
2025-05-07T20:33:21.5453018Z     
2025-05-07T20:33:21.5453292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.5453644Z     
2025-05-07T20:33:21.5453841Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.5454127Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.5456169Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.5458067Z 
2025-05-07T20:33:21.5458191Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:21.5458407Z 
2025-05-07T20:33:21.5458510Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.5458924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.5459325Z     T=4096,
2025-05-07T20:33:21.5459514Z     D=7168,
2025-05-07T20:33:21.5459708Z     scale_ub=1200.0,
2025-05-07T20:33:21.5459927Z     contiguous=True,
2025-05-07T20:33:21.5460157Z     compiled=True,
2025-05-07T20:33:21.5460366Z )
2025-05-07T20:33:21.5460685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.5461186Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:21.5461470Z 
2025-05-07T20:33:21.5461572Z     @given(
2025-05-07T20:33:21.5461813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.5462135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.5462441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.5462776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.5463117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.5463405Z     )
2025-05-07T20:33:21.5463750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.5464192Z     def test_silu_mul_quant(
2025-05-07T20:33:21.5464437Z         self,
2025-05-07T20:33:21.5464631Z         T: int,
2025-05-07T20:33:21.5464829Z         D: int,
2025-05-07T20:33:21.5465047Z         scale_ub: Optional[float],
2025-05-07T20:33:21.5465315Z         contiguous: bool,
2025-05-07T20:33:21.5465911Z         compiled: bool,
2025-05-07T20:33:21.5466145Z     ) -> None:
2025-05-07T20:33:21.5466362Z         torch.manual_seed(2025)
2025-05-07T20:33:21.5466611Z     
2025-05-07T20:33:21.5466955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.5467299Z     
2025-05-07T20:33:21.5467496Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.5467790Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.5469818Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.5471690Z 
2025-05-07T20:33:21.5471872Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:21.5472087Z 
2025-05-07T20:33:21.5472194Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.5472608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.5473018Z     T=16384,
2025-05-07T20:33:21.5473211Z     D=7168,
2025-05-07T20:33:21.5473405Z     scale_ub=None,
2025-05-07T20:33:21.5473626Z     contiguous=False,
2025-05-07T20:33:21.5473848Z     compiled=False,
2025-05-07T20:33:21.5474055Z )
2025-05-07T20:33:21.5474373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.5474873Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:21.5475153Z 
2025-05-07T20:33:21.5475233Z     @given(
2025-05-07T20:33:21.5475466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.5475842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.5476155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.5476493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.5476830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.5477117Z     )
2025-05-07T20:33:21.5477470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.5477915Z     def test_silu_mul_quant(
2025-05-07T20:33:21.5478155Z         self,
2025-05-07T20:33:21.5478353Z         T: int,
2025-05-07T20:33:21.5478553Z         D: int,
2025-05-07T20:33:21.5478781Z         scale_ub: Optional[float],
2025-05-07T20:33:21.5479055Z         contiguous: bool,
2025-05-07T20:33:21.5479296Z         compiled: bool,
2025-05-07T20:33:21.5479522Z     ) -> None:
2025-05-07T20:33:21.5479740Z         torch.manual_seed(2025)
2025-05-07T20:33:21.5479993Z     
2025-05-07T20:33:21.5480266Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.5482336Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.5484227Z 
2025-05-07T20:33:21.5484345Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.5484564Z 
2025-05-07T20:33:21.5484669Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.5485085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.5485489Z     T=2048,
2025-05-07T20:33:21.5485678Z     D=7168,
2025-05-07T20:33:21.5485981Z     scale_ub=1200.0,
2025-05-07T20:33:21.5486208Z     contiguous=True,
2025-05-07T20:33:21.5486432Z     compiled=True,
2025-05-07T20:33:21.5486639Z )
2025-05-07T20:33:21.5486961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.5487498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:21.5487774Z 
2025-05-07T20:33:21.5487854Z     @given(
2025-05-07T20:33:21.5488087Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.5488399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.5488707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.5489043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.5489421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.5489707Z     )
2025-05-07T20:33:21.5490055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.5490500Z     def test_silu_mul_quant(
2025-05-07T20:33:21.5490744Z         self,
2025-05-07T20:33:21.5490943Z         T: int,
2025-05-07T20:33:21.5491188Z         D: int,
2025-05-07T20:33:21.5491412Z         scale_ub: Optional[float],
2025-05-07T20:33:21.5491690Z         contiguous: bool,
2025-05-07T20:33:21.5491930Z         compiled: bool,
2025-05-07T20:33:21.5492150Z     ) -> None:
2025-05-07T20:33:21.5492368Z         torch.manual_seed(2025)
2025-05-07T20:33:21.5492611Z     
2025-05-07T20:33:21.5492881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.5493225Z     
2025-05-07T20:33:21.5493422Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.5493712Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.5495713Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.5497576Z 
2025-05-07T20:33:21.5497694Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:21.5497910Z 
2025-05-07T20:33:21.5498014Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.5498425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.5498825Z     T=2048,
2025-05-07T20:33:21.5499019Z     D=7168,
2025-05-07T20:33:21.5499215Z     scale_ub=None,
2025-05-07T20:33:21.5499423Z     contiguous=True,
2025-05-07T20:33:21.5499649Z     compiled=False,
2025-05-07T20:33:21.5499857Z )
2025-05-07T20:33:21.6638345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.6639393Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.6639811Z 
2025-05-07T20:33:21.6639899Z     @given(
2025-05-07T20:33:21.6640138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.6640459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.6640763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.6641097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.6641430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.6641715Z     )
2025-05-07T20:33:21.6642066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.6642517Z     def test_silu_mul_quant(
2025-05-07T20:33:21.6642761Z         self,
2025-05-07T20:33:21.6642963Z         T: int,
2025-05-07T20:33:21.6643170Z         D: int,
2025-05-07T20:33:21.6643385Z         scale_ub: Optional[float],
2025-05-07T20:33:21.6643654Z         contiguous: bool,
2025-05-07T20:33:21.6644002Z         compiled: bool,
2025-05-07T20:33:21.6644282Z     ) -> None:
2025-05-07T20:33:21.6644512Z         torch.manual_seed(2025)
2025-05-07T20:33:21.6644756Z     
2025-05-07T20:33:21.6645025Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.6645431Z     
2025-05-07T20:33:21.6645626Z >       x_sign = torch.sign(x)
2025-05-07T20:33:21.6647583Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.6649437Z 
2025-05-07T20:33:21.6649568Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:21.6649783Z 
2025-05-07T20:33:21.6649950Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.6650365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.6650772Z     T=1,
2025-05-07T20:33:21.6650957Z     D=7168,
2025-05-07T20:33:21.6651154Z     scale_ub=1200.0,
2025-05-07T20:33:21.6651382Z     contiguous=True,
2025-05-07T20:33:21.6651606Z     compiled=False,
2025-05-07T20:33:21.6651808Z )
2025-05-07T20:33:21.6652131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.6652621Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:21.6652885Z 
2025-05-07T20:33:21.6652963Z     @given(
2025-05-07T20:33:21.6653200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.6653519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.6653825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.6654166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.6654499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.6654793Z     )
2025-05-07T20:33:21.6655147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.6655589Z     def test_silu_mul_quant(
2025-05-07T20:33:21.6655834Z         self,
2025-05-07T20:33:21.6656025Z         T: int,
2025-05-07T20:33:21.6656225Z         D: int,
2025-05-07T20:33:21.6656443Z         scale_ub: Optional[float],
2025-05-07T20:33:21.6656708Z         contiguous: bool,
2025-05-07T20:33:21.6656954Z         compiled: bool,
2025-05-07T20:33:21.6657177Z     ) -> None:
2025-05-07T20:33:21.6657391Z         torch.manual_seed(2025)
2025-05-07T20:33:21.6657631Z     
2025-05-07T20:33:21.6657904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.6658243Z     
2025-05-07T20:33:21.6658440Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.6658735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.6659050Z         x = x_sign * x_clamp
2025-05-07T20:33:21.6659295Z         x0 = x[:, :D]
2025-05-07T20:33:21.6659516Z         x1 = x[:, D:]
2025-05-07T20:33:21.6659724Z     
2025-05-07T20:33:21.6659910Z         if contiguous:
2025-05-07T20:33:21.6660145Z             x0 = x0.contiguous()
2025-05-07T20:33:21.6660402Z             x1 = x1.contiguous()
2025-05-07T20:33:21.6660645Z     
2025-05-07T20:33:21.6660839Z         if scale_ub is not None:
2025-05-07T20:33:21.6661115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.6661446Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.6661756Z             )
2025-05-07T20:33:21.6661950Z         else:
2025-05-07T20:33:21.6662161Z             scale_ub_tensor = None
2025-05-07T20:33:21.6662414Z     
2025-05-07T20:33:21.6662649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.6662970Z             op = silu_mul_quant
2025-05-07T20:33:21.6663310Z             if compiled:
2025-05-07T20:33:21.6663566Z                 op = torch.compile(op)
2025-05-07T20:33:21.6663861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.6664178Z     
2025-05-07T20:33:21.6664377Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.6664542Z 
2025-05-07T20:33:21.6664642Z moe/activation_test.py:117: 
2025-05-07T20:33:21.6664945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6665276Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.6665750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.6671698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.6672430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.6672969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.6673662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.6674431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.6674971Z     kernel = self.compile(
2025-05-07T20:33:21.6675510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.6676227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.6676624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6676857Z 
2025-05-07T20:33:21.6677064Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf01f920>
2025-05-07T20:33:21.6678153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.6679600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf0aa520>}
2025-05-07T20:33:21.6680942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.6681973Z context = <triton._C.libtriton.ir.context object at 0x7f32cf276db0>
2025-05-07T20:33:21.6682263Z 
2025-05-07T20:33:21.6682433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.6682958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.6683429Z                            module_map=module_map)
2025-05-07T20:33:21.6683796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.6684152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.6684417Z E       ^
2025-05-07T20:33:21.6684880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.6685329Z 
2025-05-07T20:33:21.6685741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.6686259Z 
2025-05-07T20:33:21.6686363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.6686780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.6687189Z     T=128,
2025-05-07T20:33:21.6687379Z     D=5120,
2025-05-07T20:33:21.6687573Z     scale_ub=None,
2025-05-07T20:33:21.6687795Z     contiguous=True,
2025-05-07T20:33:21.6688014Z     compiled=False,
2025-05-07T20:33:21.6688224Z )
2025-05-07T20:33:21.7361286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.7362069Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.7362537Z 
2025-05-07T20:33:21.7362647Z     @given(
2025-05-07T20:33:21.7362961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.7363338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.7363731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.7364058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.7364387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.7364672Z     )
2025-05-07T20:33:21.7365017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.7365627Z     def test_silu_mul_quant(
2025-05-07T20:33:21.7365866Z         self,
2025-05-07T20:33:21.7366055Z         T: int,
2025-05-07T20:33:21.7366248Z         D: int,
2025-05-07T20:33:21.7366465Z         scale_ub: Optional[float],
2025-05-07T20:33:21.7366734Z         contiguous: bool,
2025-05-07T20:33:21.7366980Z         compiled: bool,
2025-05-07T20:33:21.7367212Z     ) -> None:
2025-05-07T20:33:21.7367428Z         torch.manual_seed(2025)
2025-05-07T20:33:21.7367741Z     
2025-05-07T20:33:21.7368014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.7368367Z     
2025-05-07T20:33:21.7368555Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.7368845Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.7369156Z         x = x_sign * x_clamp
2025-05-07T20:33:21.7369426Z         x0 = x[:, :D]
2025-05-07T20:33:21.7369661Z         x1 = x[:, D:]
2025-05-07T20:33:21.7369874Z     
2025-05-07T20:33:21.7370056Z         if contiguous:
2025-05-07T20:33:21.7370293Z             x0 = x0.contiguous()
2025-05-07T20:33:21.7370556Z             x1 = x1.contiguous()
2025-05-07T20:33:21.7370790Z     
2025-05-07T20:33:21.7370987Z         if scale_ub is not None:
2025-05-07T20:33:21.7371257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.7371587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.7371902Z             )
2025-05-07T20:33:21.7372097Z         else:
2025-05-07T20:33:21.7372314Z             scale_ub_tensor = None
2025-05-07T20:33:21.7372565Z     
2025-05-07T20:33:21.7372799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.7373111Z             op = silu_mul_quant
2025-05-07T20:33:21.7373363Z             if compiled:
2025-05-07T20:33:21.7373607Z                 op = torch.compile(op)
2025-05-07T20:33:21.7373900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.7374168Z     
2025-05-07T20:33:21.7374364Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.7374526Z 
2025-05-07T20:33:21.7374631Z moe/activation_test.py:117: 
2025-05-07T20:33:21.7374926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.7375265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.7375542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.7376225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.7376919Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.7377456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.7378136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.7378793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.7379326Z     kernel = self.compile(
2025-05-07T20:33:21.7379862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.7380518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.7380908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.7381142Z 
2025-05-07T20:33:21.7381422Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2a9d00>
2025-05-07T20:33:21.7382560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.7383986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf0ab420>}
2025-05-07T20:33:21.7385322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.7386351Z context = <triton._C.libtriton.ir.context object at 0x7f32cef47fb0>
2025-05-07T20:33:21.7386645Z 
2025-05-07T20:33:21.7386809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.7387374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.7387840Z                            module_map=module_map)
2025-05-07T20:33:21.7388210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.7388572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.7388827Z E       ^
2025-05-07T20:33:21.7389300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.7389749Z 
2025-05-07T20:33:21.7390158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.7390664Z 
2025-05-07T20:33:21.7390773Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.7391178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.7391577Z     T=128,
2025-05-07T20:33:21.7391772Z     D=7168,
2025-05-07T20:33:21.7391969Z     scale_ub=None,
2025-05-07T20:33:21.7392179Z     contiguous=True,
2025-05-07T20:33:21.7392410Z     compiled=False,
2025-05-07T20:33:21.7392636Z )
2025-05-07T20:33:21.7392953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.7393444Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.7393720Z 
2025-05-07T20:33:21.7393795Z     @given(
2025-05-07T20:33:21.7394033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.7394339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.7394644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.7394968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.7395295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.7395580Z     )
2025-05-07T20:33:21.7395990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.7396434Z     def test_silu_mul_quant(
2025-05-07T20:33:21.7396681Z         self,
2025-05-07T20:33:21.7396872Z         T: int,
2025-05-07T20:33:21.7397074Z         D: int,
2025-05-07T20:33:21.7397289Z         scale_ub: Optional[float],
2025-05-07T20:33:21.7397565Z         contiguous: bool,
2025-05-07T20:33:21.7397800Z         compiled: bool,
2025-05-07T20:33:21.7398018Z     ) -> None:
2025-05-07T20:33:21.7398229Z         torch.manual_seed(2025)
2025-05-07T20:33:21.7398464Z     
2025-05-07T20:33:21.7398733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.7399071Z     
2025-05-07T20:33:21.7399280Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.7399602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.7399912Z         x = x_sign * x_clamp
2025-05-07T20:33:21.7400146Z         x0 = x[:, :D]
2025-05-07T20:33:21.7400364Z         x1 = x[:, D:]
2025-05-07T20:33:21.7400570Z     
2025-05-07T20:33:21.7400748Z         if contiguous:
2025-05-07T20:33:21.7401032Z             x0 = x0.contiguous()
2025-05-07T20:33:21.7401329Z             x1 = x1.contiguous()
2025-05-07T20:33:21.7401566Z     
2025-05-07T20:33:21.7401754Z         if scale_ub is not None:
2025-05-07T20:33:21.7402022Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.7402390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.7402699Z             )
2025-05-07T20:33:21.7402890Z         else:
2025-05-07T20:33:21.7403101Z             scale_ub_tensor = None
2025-05-07T20:33:21.7403346Z     
2025-05-07T20:33:21.7403582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.7403892Z             op = silu_mul_quant
2025-05-07T20:33:21.7404138Z             if compiled:
2025-05-07T20:33:21.7404379Z                 op = torch.compile(op)
2025-05-07T20:33:21.7404671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.7404935Z     
2025-05-07T20:33:21.7405130Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.7405296Z 
2025-05-07T20:33:21.7405408Z moe/activation_test.py:117: 
2025-05-07T20:33:21.7405778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.7406122Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.7406406Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.7407096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.7407783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.7408324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.7409005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.7409666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.7410202Z     kernel = self.compile(
2025-05-07T20:33:21.7410747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.7411406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.7411807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.7412049Z 
2025-05-07T20:33:21.7412255Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2a9130>
2025-05-07T20:33:21.7413334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.7414706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cef8c4a0>}
2025-05-07T20:33:21.7416050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.7417081Z context = <triton._C.libtriton.ir.context object at 0x7f32ced40170>
2025-05-07T20:33:21.7417375Z 
2025-05-07T20:33:21.7417542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.7418069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.7418532Z                            module_map=module_map)
2025-05-07T20:33:21.7418900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.7419269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.7419575Z E       ^
2025-05-07T20:33:21.7420041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.7420495Z 
2025-05-07T20:33:21.7420956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.7421502Z 
2025-05-07T20:33:21.7421616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.7422032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.7422474Z     T=2048,
2025-05-07T20:33:21.7422666Z     D=7168,
2025-05-07T20:33:21.7422864Z     scale_ub=1200.0,
2025-05-07T20:33:21.7423089Z     contiguous=True,
2025-05-07T20:33:21.7423316Z     compiled=False,
2025-05-07T20:33:21.7423521Z )
2025-05-07T20:33:21.8233600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.8234349Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:21.8234735Z 
2025-05-07T20:33:21.8234844Z     @given(
2025-05-07T20:33:21.8235160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.8235507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.8235883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.8236235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.8236695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.8236986Z     )
2025-05-07T20:33:21.8237347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.8237803Z     def test_silu_mul_quant(
2025-05-07T20:33:21.8238047Z         self,
2025-05-07T20:33:21.8238248Z         T: int,
2025-05-07T20:33:21.8238452Z         D: int,
2025-05-07T20:33:21.8238676Z         scale_ub: Optional[float],
2025-05-07T20:33:21.8238953Z         contiguous: bool,
2025-05-07T20:33:21.8239209Z         compiled: bool,
2025-05-07T20:33:21.8239438Z     ) -> None:
2025-05-07T20:33:21.8239657Z         torch.manual_seed(2025)
2025-05-07T20:33:21.8239906Z     
2025-05-07T20:33:21.8240183Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.8242252Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.8244126Z 
2025-05-07T20:33:21.8244247Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.8244469Z 
2025-05-07T20:33:21.8244576Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.8244991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.8245405Z     T=1,
2025-05-07T20:33:21.8245591Z     D=5120,
2025-05-07T20:33:21.8245793Z     scale_ub=1200.0,
2025-05-07T20:33:21.8246019Z     contiguous=True,
2025-05-07T20:33:21.8246245Z     compiled=False,
2025-05-07T20:33:21.8246467Z )
2025-05-07T20:33:21.8246791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.8247279Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:21.8247551Z 
2025-05-07T20:33:21.8247631Z     @given(
2025-05-07T20:33:21.8247868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.8248181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.8248493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.8248827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.8249169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.8249498Z     )
2025-05-07T20:33:21.8249857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.8250304Z     def test_silu_mul_quant(
2025-05-07T20:33:21.8250550Z         self,
2025-05-07T20:33:21.8250756Z         T: int,
2025-05-07T20:33:21.8251087Z         D: int,
2025-05-07T20:33:21.8251309Z         scale_ub: Optional[float],
2025-05-07T20:33:21.8251590Z         contiguous: bool,
2025-05-07T20:33:21.8251835Z         compiled: bool,
2025-05-07T20:33:21.8252118Z     ) -> None:
2025-05-07T20:33:21.8252338Z         torch.manual_seed(2025)
2025-05-07T20:33:21.8252588Z     
2025-05-07T20:33:21.8252861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.8253210Z     
2025-05-07T20:33:21.8253407Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.8253699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.8254015Z         x = x_sign * x_clamp
2025-05-07T20:33:21.8254261Z         x0 = x[:, :D]
2025-05-07T20:33:21.8254485Z         x1 = x[:, D:]
2025-05-07T20:33:21.8254695Z     
2025-05-07T20:33:21.8254888Z         if contiguous:
2025-05-07T20:33:21.8255123Z             x0 = x0.contiguous()
2025-05-07T20:33:21.8255379Z             x1 = x1.contiguous()
2025-05-07T20:33:21.8255626Z     
2025-05-07T20:33:21.8255828Z         if scale_ub is not None:
2025-05-07T20:33:21.8256149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.8256490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.8256809Z             )
2025-05-07T20:33:21.8257006Z         else:
2025-05-07T20:33:21.8257223Z             scale_ub_tensor = None
2025-05-07T20:33:21.8257484Z     
2025-05-07T20:33:21.8257718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.8258036Z             op = silu_mul_quant
2025-05-07T20:33:21.8258297Z             if compiled:
2025-05-07T20:33:21.8258546Z                 op = torch.compile(op)
2025-05-07T20:33:21.8258844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.8259125Z     
2025-05-07T20:33:21.8259321Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.8259491Z 
2025-05-07T20:33:21.8259593Z moe/activation_test.py:117: 
2025-05-07T20:33:21.8259894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.8260238Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.8260519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.8261214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.8261908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.8262446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.8263137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.8263809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.8264347Z     kernel = self.compile(
2025-05-07T20:33:21.8264891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.8265739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.8266148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.8266380Z 
2025-05-07T20:33:21.8266593Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2aa9f0>
2025-05-07T20:33:21.8267675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.8269044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cef8da80>}
2025-05-07T20:33:21.8270390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.8271767Z context = <triton._C.libtriton.ir.context object at 0x7f32cef26430>
2025-05-07T20:33:21.8272111Z 
2025-05-07T20:33:21.8272301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.8272968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.8273513Z                            module_map=module_map)
2025-05-07T20:33:21.8273918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.8274314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.8274600Z E       ^
2025-05-07T20:33:21.8275145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.8275690Z 
2025-05-07T20:33:21.8276241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.8276859Z 
2025-05-07T20:33:21.8276974Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.8277516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.8277982Z     T=2048,
2025-05-07T20:33:21.8278177Z     D=5120,
2025-05-07T20:33:21.8278388Z     scale_ub=None,
2025-05-07T20:33:21.8278621Z     contiguous=True,
2025-05-07T20:33:21.8278855Z     compiled=False,
2025-05-07T20:33:21.8279081Z )
2025-05-07T20:33:21.8279440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.8280007Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.8280325Z 
2025-05-07T20:33:21.8280409Z     @given(
2025-05-07T20:33:21.8280656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.8281003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.8281339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.8281709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.8282081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.8282397Z     )
2025-05-07T20:33:21.8282796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.8283308Z     def test_silu_mul_quant(
2025-05-07T20:33:21.8283569Z         self,
2025-05-07T20:33:21.8283773Z         T: int,
2025-05-07T20:33:21.8283983Z         D: int,
2025-05-07T20:33:21.8284212Z         scale_ub: Optional[float],
2025-05-07T20:33:21.8284512Z         contiguous: bool,
2025-05-07T20:33:21.8284772Z         compiled: bool,
2025-05-07T20:33:21.8285008Z     ) -> None:
2025-05-07T20:33:21.8285234Z         torch.manual_seed(2025)
2025-05-07T20:33:21.8285500Z     
2025-05-07T20:33:21.8285790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.8286136Z     
2025-05-07T20:33:21.8286337Z >       x_sign = torch.sign(x)
2025-05-07T20:33:21.8288300Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.8290212Z 
2025-05-07T20:33:21.8290335Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:21.8290550Z 
2025-05-07T20:33:21.8290658Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.8291072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.8291484Z     T=16384,
2025-05-07T20:33:21.8291687Z     D=5120,
2025-05-07T20:33:21.8291880Z     scale_ub=None,
2025-05-07T20:33:21.8292097Z     contiguous=True,
2025-05-07T20:33:21.8292327Z     compiled=False,
2025-05-07T20:33:21.8292533Z )
2025-05-07T20:33:21.9048859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.9049654Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.9050116Z 
2025-05-07T20:33:21.9050234Z     @given(
2025-05-07T20:33:21.9050548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.9050878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.9051184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.9051514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.9051839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.9052124Z     )
2025-05-07T20:33:21.9052475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.9052910Z     def test_silu_mul_quant(
2025-05-07T20:33:21.9053147Z         self,
2025-05-07T20:33:21.9053340Z         T: int,
2025-05-07T20:33:21.9053533Z         D: int,
2025-05-07T20:33:21.9053759Z         scale_ub: Optional[float],
2025-05-07T20:33:21.9054029Z         contiguous: bool,
2025-05-07T20:33:21.9054335Z         compiled: bool,
2025-05-07T20:33:21.9054562Z     ) -> None:
2025-05-07T20:33:21.9054788Z         torch.manual_seed(2025)
2025-05-07T20:33:21.9055031Z     
2025-05-07T20:33:21.9055299Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.9057358Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.9059232Z 
2025-05-07T20:33:21.9059360Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.9059609Z 
2025-05-07T20:33:21.9059719Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.9060124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.9060526Z     T=4096,
2025-05-07T20:33:21.9060716Z     D=5120,
2025-05-07T20:33:21.9060911Z     scale_ub=None,
2025-05-07T20:33:21.9061118Z     contiguous=True,
2025-05-07T20:33:21.9061335Z     compiled=False,
2025-05-07T20:33:21.9061536Z )
2025-05-07T20:33:21.9061847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.9062336Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:21.9062605Z 
2025-05-07T20:33:21.9062688Z     @given(
2025-05-07T20:33:21.9062909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.9063222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.9063527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.9063865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.9069746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.9070072Z     )
2025-05-07T20:33:21.9070427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.9070872Z     def test_silu_mul_quant(
2025-05-07T20:33:21.9071109Z         self,
2025-05-07T20:33:21.9071311Z         T: int,
2025-05-07T20:33:21.9071516Z         D: int,
2025-05-07T20:33:21.9071737Z         scale_ub: Optional[float],
2025-05-07T20:33:21.9072010Z         contiguous: bool,
2025-05-07T20:33:21.9072254Z         compiled: bool,
2025-05-07T20:33:21.9072476Z     ) -> None:
2025-05-07T20:33:21.9072697Z         torch.manual_seed(2025)
2025-05-07T20:33:21.9072943Z     
2025-05-07T20:33:21.9073219Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.9075390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.9077901Z 
2025-05-07T20:33:21.9078030Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.9078279Z 
2025-05-07T20:33:21.9078390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.9078862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.9079332Z     T=2048,
2025-05-07T20:33:21.9079566Z     D=5120,
2025-05-07T20:33:21.9079774Z     scale_ub=None,
2025-05-07T20:33:21.9079999Z     contiguous=False,
2025-05-07T20:33:21.9080243Z     compiled=False,
2025-05-07T20:33:21.9080461Z )
2025-05-07T20:33:21.9080882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.9081385Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:21.9081667Z 
2025-05-07T20:33:21.9081748Z     @given(
2025-05-07T20:33:21.9081978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.9082291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.9082602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.9082934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.9083262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.9083555Z     )
2025-05-07T20:33:21.9083907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.9084352Z     def test_silu_mul_quant(
2025-05-07T20:33:21.9084593Z         self,
2025-05-07T20:33:21.9084803Z         T: int,
2025-05-07T20:33:21.9085001Z         D: int,
2025-05-07T20:33:21.9085219Z         scale_ub: Optional[float],
2025-05-07T20:33:21.9085490Z         contiguous: bool,
2025-05-07T20:33:21.9085733Z         compiled: bool,
2025-05-07T20:33:21.9085978Z     ) -> None:
2025-05-07T20:33:21.9086186Z         torch.manual_seed(2025)
2025-05-07T20:33:21.9086427Z     
2025-05-07T20:33:21.9086697Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.9088750Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.9090615Z 
2025-05-07T20:33:21.9090734Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.9090950Z 
2025-05-07T20:33:21.9091055Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.9091469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.9091875Z     T=4096,
2025-05-07T20:33:21.9092058Z     D=7168,
2025-05-07T20:33:21.9092253Z     scale_ub=None,
2025-05-07T20:33:21.9092466Z     contiguous=True,
2025-05-07T20:33:21.9092687Z     compiled=True,
2025-05-07T20:33:21.9092887Z )
2025-05-07T20:33:21.9093208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.9093696Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:21.9093969Z 
2025-05-07T20:33:21.9094048Z     @given(
2025-05-07T20:33:21.9094280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.9094649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.9095000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.9095332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.9095712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.9095994Z     )
2025-05-07T20:33:21.9096339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.9096779Z     def test_silu_mul_quant(
2025-05-07T20:33:21.9097018Z         self,
2025-05-07T20:33:21.9097214Z         T: int,
2025-05-07T20:33:21.9097419Z         D: int,
2025-05-07T20:33:21.9097631Z         scale_ub: Optional[float],
2025-05-07T20:33:21.9097898Z         contiguous: bool,
2025-05-07T20:33:21.9098136Z         compiled: bool,
2025-05-07T20:33:21.9098355Z     ) -> None:
2025-05-07T20:33:21.9098572Z         torch.manual_seed(2025)
2025-05-07T20:33:21.9098812Z     
2025-05-07T20:33:21.9099079Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.9101239Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.9103111Z 
2025-05-07T20:33:21.9103228Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.9103443Z 
2025-05-07T20:33:21.9103547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.9103963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.9104361Z     T=2048,
2025-05-07T20:33:21.9104548Z     D=5120,
2025-05-07T20:33:21.9104752Z     scale_ub=1200.0,
2025-05-07T20:33:21.9104968Z     contiguous=False,
2025-05-07T20:33:21.9105201Z     compiled=False,
2025-05-07T20:33:21.9105410Z )
2025-05-07T20:33:21.9105736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.9106227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:21.9106510Z 
2025-05-07T20:33:21.9106587Z     @given(
2025-05-07T20:33:21.9106817Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.9107123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.9107433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.9107765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.9108087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.9108376Z     )
2025-05-07T20:33:21.9108721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.9109159Z     def test_silu_mul_quant(
2025-05-07T20:33:21.9109405Z         self,
2025-05-07T20:33:21.9109600Z         T: int,
2025-05-07T20:33:21.9109800Z         D: int,
2025-05-07T20:33:21.9110012Z         scale_ub: Optional[float],
2025-05-07T20:33:21.9110284Z         contiguous: bool,
2025-05-07T20:33:21.9110527Z         compiled: bool,
2025-05-07T20:33:21.9110745Z     ) -> None:
2025-05-07T20:33:21.9110964Z         torch.manual_seed(2025)
2025-05-07T20:33:21.9111210Z     
2025-05-07T20:33:21.9111475Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.9113581Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:21.9115478Z 
2025-05-07T20:33:21.9115594Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:21.9115898Z 
2025-05-07T20:33:21.9116000Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.9116416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.9116817Z     T=4096,
2025-05-07T20:33:21.9117009Z     D=7168,
2025-05-07T20:33:21.9117204Z     scale_ub=1200.0,
2025-05-07T20:33:21.9117421Z     contiguous=True,
2025-05-07T20:33:21.9117642Z     compiled=False,
2025-05-07T20:33:21.9117846Z )
2025-05-07T20:33:22.0193508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0194986Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.0195832Z 
2025-05-07T20:33:22.0196068Z     @given(
2025-05-07T20:33:22.0196545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0197384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0197994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0198654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0199297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0199582Z     )
2025-05-07T20:33:22.0199928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0200368Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0200609Z         self,
2025-05-07T20:33:22.0200800Z         T: int,
2025-05-07T20:33:22.0200995Z         D: int,
2025-05-07T20:33:22.0201208Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0201477Z         contiguous: bool,
2025-05-07T20:33:22.0201716Z         compiled: bool,
2025-05-07T20:33:22.0201945Z     ) -> None:
2025-05-07T20:33:22.0202158Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0202400Z     
2025-05-07T20:33:22.0202673Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0204723Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.0206586Z 
2025-05-07T20:33:22.0206707Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.0206916Z 
2025-05-07T20:33:22.0207018Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0207429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0207831Z     T=16384,
2025-05-07T20:33:22.0208020Z     D=7168,
2025-05-07T20:33:22.0208210Z     scale_ub=None,
2025-05-07T20:33:22.0208432Z     contiguous=False,
2025-05-07T20:33:22.0208650Z     compiled=True,
2025-05-07T20:33:22.0208858Z )
2025-05-07T20:33:22.0209171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0209718Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.0209995Z 
2025-05-07T20:33:22.0210073Z     @given(
2025-05-07T20:33:22.0210304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0210620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0210923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0211251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0211579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0211861Z     )
2025-05-07T20:33:22.0212282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0212778Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0213019Z         self,
2025-05-07T20:33:22.0213215Z         T: int,
2025-05-07T20:33:22.0213413Z         D: int,
2025-05-07T20:33:22.0213691Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0213958Z         contiguous: bool,
2025-05-07T20:33:22.0214197Z         compiled: bool,
2025-05-07T20:33:22.0214422Z     ) -> None:
2025-05-07T20:33:22.0214638Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0214883Z     
2025-05-07T20:33:22.0215155Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0217245Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.0219114Z 
2025-05-07T20:33:22.0219231Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.0219472Z 
2025-05-07T20:33:22.0219596Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0220007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0220408Z     T=4096,
2025-05-07T20:33:22.0220590Z     D=7168,
2025-05-07T20:33:22.0220779Z     scale_ub=None,
2025-05-07T20:33:22.0220990Z     contiguous=True,
2025-05-07T20:33:22.0221207Z     compiled=False,
2025-05-07T20:33:22.0221410Z )
2025-05-07T20:33:22.0221727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0222217Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.0222489Z 
2025-05-07T20:33:22.0222573Z     @given(
2025-05-07T20:33:22.0222810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0223119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0223431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0223764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0224089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0224376Z     )
2025-05-07T20:33:22.0224725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0225171Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0225409Z         self,
2025-05-07T20:33:22.0225603Z         T: int,
2025-05-07T20:33:22.0225803Z         D: int,
2025-05-07T20:33:22.0226016Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0226287Z         contiguous: bool,
2025-05-07T20:33:22.0226523Z         compiled: bool,
2025-05-07T20:33:22.0226738Z     ) -> None:
2025-05-07T20:33:22.0226956Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0227201Z     
2025-05-07T20:33:22.0227467Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0229514Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.0231381Z 
2025-05-07T20:33:22.0231498Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.0231711Z 
2025-05-07T20:33:22.0231814Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0232271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0232734Z     T=16384,
2025-05-07T20:33:22.0232925Z     D=7168,
2025-05-07T20:33:22.0233116Z     scale_ub=None,
2025-05-07T20:33:22.0233323Z     contiguous=True,
2025-05-07T20:33:22.0233586Z     compiled=False,
2025-05-07T20:33:22.0233795Z )
2025-05-07T20:33:22.0234109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0234602Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.0234882Z 
2025-05-07T20:33:22.0234967Z     @given(
2025-05-07T20:33:22.0235199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0235507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0235883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0236209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0236540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0236827Z     )
2025-05-07T20:33:22.0237227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0237662Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0237905Z         self,
2025-05-07T20:33:22.0238110Z         T: int,
2025-05-07T20:33:22.0238303Z         D: int,
2025-05-07T20:33:22.0238520Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0238789Z         contiguous: bool,
2025-05-07T20:33:22.0239026Z         compiled: bool,
2025-05-07T20:33:22.0239251Z     ) -> None:
2025-05-07T20:33:22.0239505Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0239750Z     
2025-05-07T20:33:22.0240025Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0242074Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.0243944Z 
2025-05-07T20:33:22.0244061Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.0244274Z 
2025-05-07T20:33:22.0244382Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0244799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0245209Z     T=16384,
2025-05-07T20:33:22.0245401Z     D=7168,
2025-05-07T20:33:22.0245597Z     scale_ub=1200.0,
2025-05-07T20:33:22.0245816Z     contiguous=True,
2025-05-07T20:33:22.0246037Z     compiled=False,
2025-05-07T20:33:22.0246242Z )
2025-05-07T20:33:22.0246557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0247052Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.0247337Z 
2025-05-07T20:33:22.0247419Z     @given(
2025-05-07T20:33:22.0247647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0247961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0248267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0248589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0248920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0249216Z     )
2025-05-07T20:33:22.0249566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0250000Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0250237Z         self,
2025-05-07T20:33:22.0250433Z         T: int,
2025-05-07T20:33:22.0250626Z         D: int,
2025-05-07T20:33:22.0250839Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0251110Z         contiguous: bool,
2025-05-07T20:33:22.0251347Z         compiled: bool,
2025-05-07T20:33:22.0251659Z     ) -> None:
2025-05-07T20:33:22.0251876Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0252122Z     
2025-05-07T20:33:22.0252394Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0254476Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.0256345Z 
2025-05-07T20:33:22.0256461Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.0256672Z 
2025-05-07T20:33:22.0256786Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0257232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0257636Z     T=128,
2025-05-07T20:33:22.0257821Z     D=5120,
2025-05-07T20:33:22.0258009Z     scale_ub=1200.0,
2025-05-07T20:33:22.0258232Z     contiguous=False,
2025-05-07T20:33:22.0258454Z     compiled=False,
2025-05-07T20:33:22.0258652Z )
2025-05-07T20:33:22.1549960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.1550724Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.1551108Z 
2025-05-07T20:33:22.1551217Z     @given(
2025-05-07T20:33:22.1551533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.1551878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.1552175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.1552502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.1552839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.1553124Z     )
2025-05-07T20:33:22.1553480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.1553930Z     def test_silu_mul_quant(
2025-05-07T20:33:22.1554174Z         self,
2025-05-07T20:33:22.1554373Z         T: int,
2025-05-07T20:33:22.1554576Z         D: int,
2025-05-07T20:33:22.1554791Z         scale_ub: Optional[float],
2025-05-07T20:33:22.1555060Z         contiguous: bool,
2025-05-07T20:33:22.1555299Z         compiled: bool,
2025-05-07T20:33:22.1555521Z     ) -> None:
2025-05-07T20:33:22.1555803Z         torch.manual_seed(2025)
2025-05-07T20:33:22.1556049Z     
2025-05-07T20:33:22.1556321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.1556668Z     
2025-05-07T20:33:22.1556864Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.1557150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.1557458Z         x = x_sign * x_clamp
2025-05-07T20:33:22.1557701Z         x0 = x[:, :D]
2025-05-07T20:33:22.1557925Z         x1 = x[:, D:]
2025-05-07T20:33:22.1558126Z     
2025-05-07T20:33:22.1558316Z         if contiguous:
2025-05-07T20:33:22.1558550Z             x0 = x0.contiguous()
2025-05-07T20:33:22.1558802Z             x1 = x1.contiguous()
2025-05-07T20:33:22.1559042Z     
2025-05-07T20:33:22.1559240Z         if scale_ub is not None:
2025-05-07T20:33:22.1559542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.1559895Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.1560201Z             )
2025-05-07T20:33:22.1560387Z         else:
2025-05-07T20:33:22.1560597Z             scale_ub_tensor = None
2025-05-07T20:33:22.1560854Z     
2025-05-07T20:33:22.1561080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.1561396Z             op = silu_mul_quant
2025-05-07T20:33:22.1561646Z             if compiled:
2025-05-07T20:33:22.1561890Z                 op = torch.compile(op)
2025-05-07T20:33:22.1562363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.1562641Z     
2025-05-07T20:33:22.1562835Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.1563003Z 
2025-05-07T20:33:22.1563164Z moe/activation_test.py:117: 
2025-05-07T20:33:22.1563458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.1563790Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.1564067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.1564759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.1565623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.1566164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.1566838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.1567570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.1568103Z     kernel = self.compile(
2025-05-07T20:33:22.1568636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.1569292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.1569734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.1569961Z 
2025-05-07T20:33:22.1570170Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cee9f200>
2025-05-07T20:33:22.1571244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.1572626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32ced287c0>}
2025-05-07T20:33:22.1573970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.1574996Z context = <triton._C.libtriton.ir.context object at 0x7f32ced49430>
2025-05-07T20:33:22.1575282Z 
2025-05-07T20:33:22.1575449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.1575963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.1576430Z                            module_map=module_map)
2025-05-07T20:33:22.1576792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.1577138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.1577400Z E       ^
2025-05-07T20:33:22.1577866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.1578322Z 
2025-05-07T20:33:22.1578736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.1579248Z 
2025-05-07T20:33:22.1579354Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.1579767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.1580167Z     T=2048,
2025-05-07T20:33:22.1580351Z     D=7168,
2025-05-07T20:33:22.1580543Z     scale_ub=None,
2025-05-07T20:33:22.1580758Z     contiguous=False,
2025-05-07T20:33:22.1580982Z     compiled=False,
2025-05-07T20:33:22.1581182Z )
2025-05-07T20:33:22.1581497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.1581991Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.1582263Z 
2025-05-07T20:33:22.1582340Z     @given(
2025-05-07T20:33:22.1582693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.1583009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.1583309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.1583699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.1584027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.1584309Z     )
2025-05-07T20:33:22.1584651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.1585102Z     def test_silu_mul_quant(
2025-05-07T20:33:22.1590740Z         self,
2025-05-07T20:33:22.1590965Z         T: int,
2025-05-07T20:33:22.1591172Z         D: int,
2025-05-07T20:33:22.1591394Z         scale_ub: Optional[float],
2025-05-07T20:33:22.1591661Z         contiguous: bool,
2025-05-07T20:33:22.1591899Z         compiled: bool,
2025-05-07T20:33:22.1592123Z     ) -> None:
2025-05-07T20:33:22.1592332Z         torch.manual_seed(2025)
2025-05-07T20:33:22.1592581Z     
2025-05-07T20:33:22.1592932Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.1594999Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.1596916Z 
2025-05-07T20:33:22.1597043Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.1597255Z 
2025-05-07T20:33:22.1597362Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.1597767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.1598174Z     T=128,
2025-05-07T20:33:22.1598364Z     D=7168,
2025-05-07T20:33:22.1598552Z     scale_ub=1200.0,
2025-05-07T20:33:22.1598777Z     contiguous=True,
2025-05-07T20:33:22.1598998Z     compiled=True,
2025-05-07T20:33:22.1599196Z )
2025-05-07T20:33:22.1907949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.1909178Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.1909546Z 
2025-05-07T20:33:22.1909626Z     @given(
2025-05-07T20:33:22.1909861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.1910173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.1910475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.1910807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.1911139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.1911415Z     )
2025-05-07T20:33:22.1911764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.1912206Z     def test_silu_mul_quant(
2025-05-07T20:33:22.1912440Z         self,
2025-05-07T20:33:22.1912632Z         T: int,
2025-05-07T20:33:22.1912828Z         D: int,
2025-05-07T20:33:22.1913038Z         scale_ub: Optional[float],
2025-05-07T20:33:22.1913309Z         contiguous: bool,
2025-05-07T20:33:22.1913546Z         compiled: bool,
2025-05-07T20:33:22.1913759Z     ) -> None:
2025-05-07T20:33:22.1913972Z         torch.manual_seed(2025)
2025-05-07T20:33:22.1914216Z     
2025-05-07T20:33:22.1914481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.1914816Z     
2025-05-07T20:33:22.1915005Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.1915287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.1915592Z         x = x_sign * x_clamp
2025-05-07T20:33:22.1915893Z         x0 = x[:, :D]
2025-05-07T20:33:22.1916108Z         x1 = x[:, D:]
2025-05-07T20:33:22.1916499Z     
2025-05-07T20:33:22.1916687Z         if contiguous:
2025-05-07T20:33:22.1916920Z             x0 = x0.contiguous()
2025-05-07T20:33:22.1917172Z             x1 = x1.contiguous()
2025-05-07T20:33:22.1917472Z     
2025-05-07T20:33:22.1917661Z         if scale_ub is not None:
2025-05-07T20:33:22.1917927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.1918261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.1918573Z             )
2025-05-07T20:33:22.1918761Z         else:
2025-05-07T20:33:22.1918973Z             scale_ub_tensor = None
2025-05-07T20:33:22.1919229Z     
2025-05-07T20:33:22.1919479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.1919816Z             op = silu_mul_quant
2025-05-07T20:33:22.1920062Z             if compiled:
2025-05-07T20:33:22.1920303Z                 op = torch.compile(op)
2025-05-07T20:33:22.1920592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.1920868Z     
2025-05-07T20:33:22.1921066Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.1921234Z 
2025-05-07T20:33:22.1921424Z moe/activation_test.py:117: 
2025-05-07T20:33:22.1921717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.1922049Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.1922323Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.1922875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.1923434Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.1924078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.1924756Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.1925286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.1925961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.1926616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.1927147Z     kernel = self.compile(
2025-05-07T20:33:22.1927684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.1928331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.1928723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.1928954Z 
2025-05-07T20:33:22.1929159Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cec03e90>
2025-05-07T20:33:22.1930236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.1931610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32ced29940>}
2025-05-07T20:33:22.1932940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.1933959Z context = <triton._C.libtriton.ir.context object at 0x7f32cec371b0>
2025-05-07T20:33:22.1934248Z 
2025-05-07T20:33:22.1934411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.1934926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.1935386Z                            module_map=module_map)
2025-05-07T20:33:22.1935748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.1936097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.1936437Z E       ^
2025-05-07T20:33:22.1936907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.1937359Z 
2025-05-07T20:33:22.1937810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.1938315Z 
2025-05-07T20:33:22.1938420Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.1938827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.1939229Z     T=128,
2025-05-07T20:33:22.1939422Z     D=7168,
2025-05-07T20:33:22.1939638Z     scale_ub=1200.0,
2025-05-07T20:33:22.1939870Z     contiguous=True,
2025-05-07T20:33:22.1940088Z     compiled=False,
2025-05-07T20:33:22.1940286Z )
2025-05-07T20:33:22.1940605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.1941088Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.1941359Z 
2025-05-07T20:33:22.1941443Z     @given(
2025-05-07T20:33:22.1941711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.1942023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.1942328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.1942652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.1942983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.1943272Z     )
2025-05-07T20:33:22.1943612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.1944052Z     def test_silu_mul_quant(
2025-05-07T20:33:22.1944291Z         self,
2025-05-07T20:33:22.1944487Z         T: int,
2025-05-07T20:33:22.1944680Z         D: int,
2025-05-07T20:33:22.1944897Z         scale_ub: Optional[float],
2025-05-07T20:33:22.1945157Z         contiguous: bool,
2025-05-07T20:33:22.1945396Z         compiled: bool,
2025-05-07T20:33:22.1945616Z     ) -> None:
2025-05-07T20:33:22.1945828Z         torch.manual_seed(2025)
2025-05-07T20:33:22.1946069Z     
2025-05-07T20:33:22.1946344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.1946681Z     
2025-05-07T20:33:22.1946879Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.1947163Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.1949162Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.1951064Z 
2025-05-07T20:33:22.1951185Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.1951397Z 
2025-05-07T20:33:22.1951498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.1951902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.1952300Z     T=128,
2025-05-07T20:33:22.1952480Z     D=5120,
2025-05-07T20:33:22.1952669Z     scale_ub=1200.0,
2025-05-07T20:33:22.1952886Z     contiguous=True,
2025-05-07T20:33:22.1953101Z     compiled=True,
2025-05-07T20:33:22.1953301Z )
2025-05-07T20:33:22.1953613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.1954090Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.1954360Z 
2025-05-07T20:33:22.1954436Z     @given(
2025-05-07T20:33:22.1954660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.1954967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.1955265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.1955676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.1956074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.1956354Z     )
2025-05-07T20:33:22.1956738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.1957174Z     def test_silu_mul_quant(
2025-05-07T20:33:22.1957407Z         self,
2025-05-07T20:33:22.1957597Z         T: int,
2025-05-07T20:33:22.1957791Z         D: int,
2025-05-07T20:33:22.1958006Z         scale_ub: Optional[float],
2025-05-07T20:33:22.1958270Z         contiguous: bool,
2025-05-07T20:33:22.1958507Z         compiled: bool,
2025-05-07T20:33:22.1958720Z     ) -> None:
2025-05-07T20:33:22.1958929Z         torch.manual_seed(2025)
2025-05-07T20:33:22.1959166Z     
2025-05-07T20:33:22.1959423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.1959761Z     
2025-05-07T20:33:22.1959951Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.1960238Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.1962269Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.1964122Z 
2025-05-07T20:33:22.1964238Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.1964453Z 
2025-05-07T20:33:22.1964557Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.1964962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.1965525Z     T=128,
2025-05-07T20:33:22.1965717Z     D=7168,
2025-05-07T20:33:22.1965908Z     scale_ub=None,
2025-05-07T20:33:22.1966122Z     contiguous=True,
2025-05-07T20:33:22.1966334Z     compiled=True,
2025-05-07T20:33:22.1966533Z )
2025-05-07T20:33:22.4492011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4493018Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4493558Z 
2025-05-07T20:33:22.4493715Z     @given(
2025-05-07T20:33:22.4494168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4494782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4495379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4496024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4496669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4497225Z     )
2025-05-07T20:33:22.4497905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4498795Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4499271Z         self,
2025-05-07T20:33:22.4499557Z         T: int,
2025-05-07T20:33:22.4499759Z         D: int,
2025-05-07T20:33:22.4499981Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4500249Z         contiguous: bool,
2025-05-07T20:33:22.4500494Z         compiled: bool,
2025-05-07T20:33:22.4500716Z     ) -> None:
2025-05-07T20:33:22.4500939Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4501182Z     
2025-05-07T20:33:22.4501446Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4503606Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.4505999Z 
2025-05-07T20:33:22.4506118Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.4506333Z 
2025-05-07T20:33:22.4517557Z FAILED
2025-05-07T20:33:22.4517800Z 
2025-05-07T20:33:22.4517968Z =================================== FAILURES ===================================
2025-05-07T20:33:22.4518479Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:22.4519050Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:22.4519969Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:22.4520722Z   |     yield
2025-05-07T20:33:22.4521305Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:22.4522028Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:22.4522976Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:22.4523880Z   |     if method() is not None:
2025-05-07T20:33:22.4524221Z   |        ^^^^^^^^
2025-05-07T20:33:22.4525103Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:22.4526085Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4526516Z   |            ^^^^^^^
2025-05-07T20:33:22.4527276Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:22.4528116Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:22.4528702Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:22.4529279Z   +-+---------------- 1 ----------------
2025-05-07T20:33:22.4529741Z     | Traceback (most recent call last):
2025-05-07T20:33:22.4530713Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:22.4531761Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4532268Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4534971Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.4537663Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:22.4538273Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4538832Z     |     T=2048,
2025-05-07T20:33:22.4539143Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:22.4539603Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:22.4540101Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:22.4540594Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:22.4541021Z     | )
2025-05-07T20:33:22.4541271Z     | 
2025-05-07T20:33:22.4541987Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:22.4542803Z     +---------------- 2 ----------------
2025-05-07T20:33:22.4543203Z     | Traceback (most recent call last):
2025-05-07T20:33:22.4544315Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:22.4545416Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4545933Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4548645Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.4551321Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:22.4551968Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4552538Z     |     T=128,
2025-05-07T20:33:22.4552861Z     |     D=7168,
2025-05-07T20:33:22.4553148Z     |     scale_ub=None,
2025-05-07T20:33:22.4553467Z     |     contiguous=True,
2025-05-07T20:33:22.4553799Z     |     compiled=True,
2025-05-07T20:33:22.4554116Z     | )
2025-05-07T20:33:22.4554358Z     | 
2025-05-07T20:33:22.4555069Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:22.4556010Z     +---------------- 3 ----------------
2025-05-07T20:33:22.4556423Z     | Traceback (most recent call last):
2025-05-07T20:33:22.4557377Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:22.4558442Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4558959Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4561569Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.4563550Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:22.4563984Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4564398Z     |     T=128,
2025-05-07T20:33:22.4564603Z     |     D=5120,
2025-05-07T20:33:22.4564809Z     |     scale_ub=1200.0,
2025-05-07T20:33:22.4565054Z     |     contiguous=True,
2025-05-07T20:33:22.4565297Z     |     compiled=True,
2025-05-07T20:33:22.4565702Z     | )
2025-05-07T20:33:22.4565890Z     | 
2025-05-07T20:33:22.4566412Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:22.4567019Z     +---------------- 4 ----------------
2025-05-07T20:33:22.4567309Z     | Traceback (most recent call last):
2025-05-07T20:33:22.4568018Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:22.4568734Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4569021Z     |                              ^^^^^^^^
2025-05-07T20:33:22.4569762Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:22.4570507Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4570845Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4571694Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:22.4572486Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4573095Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:22.4573828Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4574270Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4574910Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:22.4575739Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4576215Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4576850Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:22.4577544Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4577918Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4578507Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:22.4579078Z     |     fn()
2025-05-07T20:33:22.4579672Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:22.4580330Z     |     self.fn.run(
2025-05-07T20:33:22.4580852Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:22.4581434Z     |     kernel = self.compile(
2025-05-07T20:33:22.4581698Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:22.4582282Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:22.4582984Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4583372Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4584015Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:22.4584795Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4585348Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:22.4585869Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4586355Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4586715Z     | ^
2025-05-07T20:33:22.4587335Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4588112Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:22.4588663Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:22.4589380Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4589982Z     |     T=1,  # or any other generated value
2025-05-07T20:33:22.4590420Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:22.4590885Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:22.4591504Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:22.4592020Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:22.4592434Z     | )
2025-05-07T20:33:22.4592743Z     | 
2025-05-07T20:33:22.4593461Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:22.4594290Z     +------------------------------------
2025-05-07T20:33:22.4594785Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:22.4595306Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4596002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4596561Z     T=1,
2025-05-07T20:33:22.4596823Z     D=5120,
2025-05-07T20:33:22.4597094Z     scale_ub=None,
2025-05-07T20:33:22.4597384Z     contiguous=True,
2025-05-07T20:33:22.4597694Z     compiled=True,
2025-05-07T20:33:22.4597989Z )
2025-05-07T20:33:22.4598432Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4599204Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4599596Z 
2025-05-07T20:33:22.4599733Z     @given(
2025-05-07T20:33:22.4600061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4600500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4619382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4619867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4620333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4620727Z     )
2025-05-07T20:33:22.4621212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4621835Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4622176Z         self,
2025-05-07T20:33:22.4622460Z         T: int,
2025-05-07T20:33:22.4622755Z         D: int,
2025-05-07T20:33:22.4623080Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4623466Z         contiguous: bool,
2025-05-07T20:33:22.4623814Z         compiled: bool,
2025-05-07T20:33:22.4624137Z     ) -> None:
2025-05-07T20:33:22.4624447Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4624797Z     
2025-05-07T20:33:22.4625189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4625675Z     
2025-05-07T20:33:22.4625956Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4626357Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4626780Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4627118Z         x0 = x[:, :D]
2025-05-07T20:33:22.4627424Z         x1 = x[:, D:]
2025-05-07T20:33:22.4627710Z     
2025-05-07T20:33:22.4627968Z         if contiguous:
2025-05-07T20:33:22.4628287Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4628644Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4628983Z     
2025-05-07T20:33:22.4629271Z         if scale_ub is not None:
2025-05-07T20:33:22.4629658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4630125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4630562Z             )
2025-05-07T20:33:22.4630836Z         else:
2025-05-07T20:33:22.4631114Z             scale_ub_tensor = None
2025-05-07T20:33:22.4631474Z     
2025-05-07T20:33:22.4631792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4632218Z             op = silu_mul_quant
2025-05-07T20:33:22.4632575Z             if compiled:
2025-05-07T20:33:22.4632929Z                 op = torch.compile(op)
2025-05-07T20:33:22.4633340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4633735Z     
2025-05-07T20:33:22.4634013Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4634409Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4634825Z     
2025-05-07T20:33:22.4635168Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4635975Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4636413Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4636852Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4637400Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4637818Z     
2025-05-07T20:33:22.4638093Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4638352Z 
2025-05-07T20:33:22.4638499Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4638889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4639343Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4639785Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4640798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4641786Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4642574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4643475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4644416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4645413Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4646454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4647325Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4648153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4648874Z     fn()
2025-05-07T20:33:22.4649578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4650360Z     self.fn.run(
2025-05-07T20:33:22.4651005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4651741Z     kernel = self.compile(
2025-05-07T20:33:22.4652483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4653375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4653935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4654255Z 
2025-05-07T20:33:22.4654541Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f8b05700>
2025-05-07T20:33:22.4655998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4657890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f3351c60>}
2025-05-07T20:33:22.4659739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4661070Z context = <triton._C.libtriton.ir.context object at 0x7f33f9ad3870>
2025-05-07T20:33:22.4661460Z 
2025-05-07T20:33:22.4661702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4662419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4663071Z                            module_map=module_map)
2025-05-07T20:33:22.4663563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4664122Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4664541Z E       ^
2025-05-07T20:33:22.4665182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4666263Z 
2025-05-07T20:33:22.4666843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4667516Z 
2025-05-07T20:33:22.4667665Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4668203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4668751Z     T=2048,
2025-05-07T20:33:22.4669028Z     D=5120,
2025-05-07T20:33:22.4669301Z     scale_ub=1200.0,
2025-05-07T20:33:22.4669620Z     contiguous=True,
2025-05-07T20:33:22.4669942Z     compiled=False,
2025-05-07T20:33:22.4670235Z )
2025-05-07T20:33:22.4670685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4671384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.4671770Z 
2025-05-07T20:33:22.4672053Z     @given(
2025-05-07T20:33:22.4672391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4672846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4673284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4673747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4674219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4674630Z     )
2025-05-07T20:33:22.4675115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4675850Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4676199Z         self,
2025-05-07T20:33:22.4676465Z         T: int,
2025-05-07T20:33:22.4676744Z         D: int,
2025-05-07T20:33:22.4677048Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4677407Z         contiguous: bool,
2025-05-07T20:33:22.4677732Z         compiled: bool,
2025-05-07T20:33:22.4678045Z     ) -> None:
2025-05-07T20:33:22.4678334Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4678683Z     
2025-05-07T20:33:22.4679059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4679523Z     
2025-05-07T20:33:22.4679795Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4680195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4680621Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4680955Z         x0 = x[:, :D]
2025-05-07T20:33:22.4681271Z         x1 = x[:, D:]
2025-05-07T20:33:22.4681560Z     
2025-05-07T20:33:22.4681813Z         if contiguous:
2025-05-07T20:33:22.4682123Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4682467Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4682798Z     
2025-05-07T20:33:22.4683060Z         if scale_ub is not None:
2025-05-07T20:33:22.4683429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4683879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4684299Z             )
2025-05-07T20:33:22.4684565Z         else:
2025-05-07T20:33:22.4684856Z             scale_ub_tensor = None
2025-05-07T20:33:22.4685203Z     
2025-05-07T20:33:22.4685512Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4685945Z             op = silu_mul_quant
2025-05-07T20:33:22.4686283Z             if compiled:
2025-05-07T20:33:22.4686608Z                 op = torch.compile(op)
2025-05-07T20:33:22.4687007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4687382Z     
2025-05-07T20:33:22.4687638Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4687867Z 
2025-05-07T20:33:22.4688003Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4688409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4688860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4689238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4690269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4691273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4691976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4692967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4693844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4694558Z     kernel = self.compile(
2025-05-07T20:33:22.4695282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4696156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4696709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4697022Z 
2025-05-07T20:33:22.4697308Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f3333440>
2025-05-07T20:33:22.4698813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4700777Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f31a8220>}
2025-05-07T20:33:22.4702595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4703923Z context = <triton._C.libtriton.ir.context object at 0x7f33f351ba70>
2025-05-07T20:33:22.4704295Z 
2025-05-07T20:33:22.4704515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4705197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4705810Z                            module_map=module_map)
2025-05-07T20:33:22.4706279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4706727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4707073Z E       ^
2025-05-07T20:33:22.4707673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4708256Z 
2025-05-07T20:33:22.4708795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4709453Z 
2025-05-07T20:33:22.4709589Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4710128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4710658Z     T=2048,
2025-05-07T20:33:22.4710904Z     D=5120,
2025-05-07T20:33:22.4711169Z     scale_ub=1200.0,
2025-05-07T20:33:22.4711476Z     contiguous=True,
2025-05-07T20:33:22.4711773Z     compiled=True,
2025-05-07T20:33:22.4712051Z )
2025-05-07T20:33:22.4712468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4713100Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.4713458Z 
2025-05-07T20:33:22.4713564Z     @given(
2025-05-07T20:33:22.4713872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4714285Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4714683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4715112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4715538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4716034Z     )
2025-05-07T20:33:22.4716523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4717257Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4717600Z         self,
2025-05-07T20:33:22.4717886Z         T: int,
2025-05-07T20:33:22.4718176Z         D: int,
2025-05-07T20:33:22.4718319Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4718546Z         contiguous: bool,
2025-05-07T20:33:22.4718673Z         compiled: bool,
2025-05-07T20:33:22.4718791Z     ) -> None:
2025-05-07T20:33:22.4718935Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4719063Z     
2025-05-07T20:33:22.4719290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4719400Z     
2025-05-07T20:33:22.4719527Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4719695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4719822Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4719936Z         x0 = x[:, :D]
2025-05-07T20:33:22.4720047Z         x1 = x[:, D:]
2025-05-07T20:33:22.4720156Z     
2025-05-07T20:33:22.4720271Z         if contiguous:
2025-05-07T20:33:22.4720402Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4720585Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4720687Z     
2025-05-07T20:33:22.4720819Z         if scale_ub is not None:
2025-05-07T20:33:22.4720979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4721165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4721285Z             )
2025-05-07T20:33:22.4721396Z         else:
2025-05-07T20:33:22.4721532Z             scale_ub_tensor = None
2025-05-07T20:33:22.4721645Z     
2025-05-07T20:33:22.4721827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4721954Z             op = silu_mul_quant
2025-05-07T20:33:22.4722085Z             if compiled:
2025-05-07T20:33:22.4722223Z                 op = torch.compile(op)
2025-05-07T20:33:22.4722370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4722484Z     
2025-05-07T20:33:22.4722608Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4722777Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4722891Z     
2025-05-07T20:33:22.4723081Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4723224Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4723372Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4723545Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4723750Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4723858Z     
2025-05-07T20:33:22.4724002Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4724009Z 
2025-05-07T20:33:22.4724156Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4724338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4724485Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4724685Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4725451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4725609Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4726104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4726414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4726931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4727303Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4727835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4728068Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4728601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4728775Z     fn()
2025-05-07T20:33:22.4729331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4729502Z     self.fn.run(
2025-05-07T20:33:22.4729980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4730114Z     kernel = self.compile(
2025-05-07T20:33:22.4730648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4730893Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4731078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4731085Z 
2025-05-07T20:33:22.4731375Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f31f4170>
2025-05-07T20:33:22.4732502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4733226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f31a96c0>}
2025-05-07T20:33:22.4734231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4734493Z context = <triton._C.libtriton.ir.context object at 0x7f33f1d62830>
2025-05-07T20:33:22.4734506Z 
2025-05-07T20:33:22.4734722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4735068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4735234Z                            module_map=module_map)
2025-05-07T20:33:22.4735445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4735588Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4735709Z E       ^
2025-05-07T20:33:22.4736171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4736177Z 
2025-05-07T20:33:22.4736719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4736726Z 
2025-05-07T20:33:22.4736864Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4737148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4737261Z     T=16384,
2025-05-07T20:33:22.4737363Z     D=7168,
2025-05-07T20:33:22.4737473Z     scale_ub=1200.0,
2025-05-07T20:33:22.4737594Z     contiguous=False,
2025-05-07T20:33:22.4737708Z     compiled=False,
2025-05-07T20:33:22.4737814Z )
2025-05-07T20:33:22.4738116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4738356Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.4738364Z 
2025-05-07T20:33:22.4738475Z     @given(
2025-05-07T20:33:22.4738624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4738748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4738904Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4739062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4739216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4739324Z     )
2025-05-07T20:33:22.4739650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4739777Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4739881Z         self,
2025-05-07T20:33:22.4739996Z         T: int,
2025-05-07T20:33:22.4740215Z         D: int,
2025-05-07T20:33:22.4740357Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4740489Z         contiguous: bool,
2025-05-07T20:33:22.4740619Z         compiled: bool,
2025-05-07T20:33:22.4740780Z     ) -> None:
2025-05-07T20:33:22.4740912Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4741025Z     
2025-05-07T20:33:22.4741255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4741365Z     
2025-05-07T20:33:22.4741501Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4741671Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4741796Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4741918Z         x0 = x[:, :D]
2025-05-07T20:33:22.4742032Z         x1 = x[:, D:]
2025-05-07T20:33:22.4742144Z     
2025-05-07T20:33:22.4742261Z         if contiguous:
2025-05-07T20:33:22.4742388Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4742524Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4742628Z     
2025-05-07T20:33:22.4742765Z         if scale_ub is not None:
2025-05-07T20:33:22.4742973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4743162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4743282Z             )
2025-05-07T20:33:22.4743396Z         else:
2025-05-07T20:33:22.4743528Z             scale_ub_tensor = None
2025-05-07T20:33:22.4743632Z     
2025-05-07T20:33:22.4743815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4743940Z             op = silu_mul_quant
2025-05-07T20:33:22.4744066Z             if compiled:
2025-05-07T20:33:22.4744202Z                 op = torch.compile(op)
2025-05-07T20:33:22.4744348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4744459Z     
2025-05-07T20:33:22.4744586Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4744592Z 
2025-05-07T20:33:22.4744724Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4744916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4745062Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4745205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4745898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4746037Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4746538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4746851Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4747317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4747455Z     kernel = self.compile(
2025-05-07T20:33:22.4747969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4748151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4748291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4748298Z 
2025-05-07T20:33:22.4748501Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f32b6540>
2025-05-07T20:33:22.4749290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4749843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f2058040>}
2025-05-07T20:33:22.4750598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4750898Z context = <triton._C.libtriton.ir.context object at 0x7f33f1dbcf70>
2025-05-07T20:33:22.4750906Z 
2025-05-07T20:33:22.4751072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4751387Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4751495Z                            module_map=module_map)
2025-05-07T20:33:22.4751665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4751765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4751844Z E       ^
2025-05-07T20:33:22.4752208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4752213Z 
2025-05-07T20:33:22.4752629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4752633Z 
2025-05-07T20:33:22.4752744Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4753014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4753095Z     T=1,
2025-05-07T20:33:22.4753187Z     D=7168,
2025-05-07T20:33:22.4753277Z     scale_ub=None,
2025-05-07T20:33:22.4753364Z     contiguous=True,
2025-05-07T20:33:22.4753455Z     compiled=True,
2025-05-07T20:33:22.4753533Z )
2025-05-07T20:33:22.4753753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4753920Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4753925Z 
2025-05-07T20:33:22.4754006Z     @given(
2025-05-07T20:33:22.4754127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4754238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4754355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4754483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4754601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4754680Z     )
2025-05-07T20:33:22.4754936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4755035Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4755121Z         self,
2025-05-07T20:33:22.4755208Z         T: int,
2025-05-07T20:33:22.4755288Z         D: int,
2025-05-07T20:33:22.4755390Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4755488Z         contiguous: bool,
2025-05-07T20:33:22.4755576Z         compiled: bool,
2025-05-07T20:33:22.4755658Z     ) -> None:
2025-05-07T20:33:22.4755877Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4755957Z     
2025-05-07T20:33:22.4756136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4756213Z     
2025-05-07T20:33:22.4756306Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4756439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4756530Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4756618Z         x0 = x[:, :D]
2025-05-07T20:33:22.4756710Z         x1 = x[:, D:]
2025-05-07T20:33:22.4756788Z     
2025-05-07T20:33:22.4756874Z         if contiguous:
2025-05-07T20:33:22.4756975Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4757067Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4757145Z     
2025-05-07T20:33:22.4757243Z         if scale_ub is not None:
2025-05-07T20:33:22.4757351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4757501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4757580Z             )
2025-05-07T20:33:22.4757658Z         else:
2025-05-07T20:33:22.4757761Z             scale_ub_tensor = None
2025-05-07T20:33:22.4757838Z     
2025-05-07T20:33:22.4757970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4758070Z             op = silu_mul_quant
2025-05-07T20:33:22.4758158Z             if compiled:
2025-05-07T20:33:22.4758259Z                 op = torch.compile(op)
2025-05-07T20:33:22.4758501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4758578Z     
2025-05-07T20:33:22.4758675Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4758802Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4758920Z     
2025-05-07T20:33:22.4759063Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4759168Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4759269Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4759398Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4759560Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4759644Z     
2025-05-07T20:33:22.4759775Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4759781Z 
2025-05-07T20:33:22.4759880Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4760011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4760133Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4760314Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4760887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4760993Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4761356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4761584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4761951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4762215Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4762595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4762772Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4763126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4763209Z     fn()
2025-05-07T20:33:22.4763609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4763699Z     self.fn.run(
2025-05-07T20:33:22.4764036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4764135Z     kernel = self.compile(
2025-05-07T20:33:22.4764516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4764691Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4764828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4764837Z 
2025-05-07T20:33:22.4765046Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2008bf0>
2025-05-07T20:33:22.4766173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4766685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f2058ea0>}
2025-05-07T20:33:22.4767428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4767623Z context = <triton._C.libtriton.ir.context object at 0x7f33f166d4f0>
2025-05-07T20:33:22.4767629Z 
2025-05-07T20:33:22.4767943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4768279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4768390Z                            module_map=module_map)
2025-05-07T20:33:22.4768624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4768734Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4768812Z E       ^
2025-05-07T20:33:22.4769166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4769181Z 
2025-05-07T20:33:22.4769592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4769597Z 
2025-05-07T20:33:22.4769705Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4769934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4770014Z     T=4096,
2025-05-07T20:33:22.4770103Z     D=5120,
2025-05-07T20:33:22.4770194Z     scale_ub=None,
2025-05-07T20:33:22.4770343Z     contiguous=False,
2025-05-07T20:33:22.4770434Z     compiled=False,
2025-05-07T20:33:22.4770523Z )
2025-05-07T20:33:22.4770745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4770927Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.4770931Z 
2025-05-07T20:33:22.4771012Z     @given(
2025-05-07T20:33:22.4771133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4771240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4771356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4771475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4771596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4771674Z     )
2025-05-07T20:33:22.4771930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4772030Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4772112Z         self,
2025-05-07T20:33:22.4772199Z         T: int,
2025-05-07T20:33:22.4772279Z         D: int,
2025-05-07T20:33:22.4772384Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4772484Z         contiguous: bool,
2025-05-07T20:33:22.4772573Z         compiled: bool,
2025-05-07T20:33:22.4772655Z     ) -> None:
2025-05-07T20:33:22.4772762Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4772836Z     
2025-05-07T20:33:22.4773004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4773089Z     
2025-05-07T20:33:22.4773181Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4773307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4773406Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4773488Z         x0 = x[:, :D]
2025-05-07T20:33:22.4773578Z         x1 = x[:, D:]
2025-05-07T20:33:22.4773654Z     
2025-05-07T20:33:22.4773744Z         if contiguous:
2025-05-07T20:33:22.4773847Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4773942Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4774018Z     
2025-05-07T20:33:22.4774118Z         if scale_ub is not None:
2025-05-07T20:33:22.4774228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4774364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4774447Z             )
2025-05-07T20:33:22.4774528Z         else:
2025-05-07T20:33:22.4774625Z             scale_ub_tensor = None
2025-05-07T20:33:22.4774707Z     
2025-05-07T20:33:22.4774837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4774937Z             op = silu_mul_quant
2025-05-07T20:33:22.4775024Z             if compiled:
2025-05-07T20:33:22.4775125Z                 op = torch.compile(op)
2025-05-07T20:33:22.4775243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4775319Z     
2025-05-07T20:33:22.4775413Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4775503Z 
2025-05-07T20:33:22.4775608Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4775744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4775887Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4776000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4776499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4776603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4776960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4777184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4777533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4777628Z     kernel = self.compile(
2025-05-07T20:33:22.4778053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4778238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4778370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4778374Z 
2025-05-07T20:33:22.4778585Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1c73c20>
2025-05-07T20:33:22.4779361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4779918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f317b240>}
2025-05-07T20:33:22.4780664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4780857Z context = <triton._C.libtriton.ir.context object at 0x7f33f19df9f0>
2025-05-07T20:33:22.4780864Z 
2025-05-07T20:33:22.4781034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4781299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4781413Z                            module_map=module_map)
2025-05-07T20:33:22.4781574Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4781676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4781763Z E       ^
2025-05-07T20:33:22.4782117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4782122Z 
2025-05-07T20:33:22.4782534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4782548Z 
2025-05-07T20:33:22.4782652Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4782877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4782963Z     T=4096,
2025-05-07T20:33:22.4783042Z     D=7168,
2025-05-07T20:33:22.4783127Z     scale_ub=None,
2025-05-07T20:33:22.4783223Z     contiguous=False,
2025-05-07T20:33:22.4783307Z     compiled=False,
2025-05-07T20:33:22.4783382Z )
2025-05-07T20:33:22.4783608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4783781Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.4783786Z 
2025-05-07T20:33:22.4783873Z     @given(
2025-05-07T20:33:22.4783993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4784092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4784297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4784423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4784539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4784661Z     )
2025-05-07T20:33:22.4784905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4785000Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4785086Z         self,
2025-05-07T20:33:22.4785165Z         T: int,
2025-05-07T20:33:22.4785244Z         D: int,
2025-05-07T20:33:22.4785349Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4785441Z         contiguous: bool,
2025-05-07T20:33:22.4785533Z         compiled: bool,
2025-05-07T20:33:22.4785613Z     ) -> None:
2025-05-07T20:33:22.4785708Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4785788Z     
2025-05-07T20:33:22.4785957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4786035Z     
2025-05-07T20:33:22.4786140Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4786305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4786397Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4786485Z         x0 = x[:, :D]
2025-05-07T20:33:22.4786571Z         x1 = x[:, D:]
2025-05-07T20:33:22.4786646Z     
2025-05-07T20:33:22.4786738Z         if contiguous:
2025-05-07T20:33:22.4786834Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4786928Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4787010Z     
2025-05-07T20:33:22.4787101Z         if scale_ub is not None:
2025-05-07T20:33:22.4787212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4787345Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4787423Z             )
2025-05-07T20:33:22.4787505Z         else:
2025-05-07T20:33:22.4787602Z             scale_ub_tensor = None
2025-05-07T20:33:22.4795015Z     
2025-05-07T20:33:22.4795173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4795289Z             op = silu_mul_quant
2025-05-07T20:33:22.4795381Z             if compiled:
2025-05-07T20:33:22.4795486Z                 op = torch.compile(op)
2025-05-07T20:33:22.4795602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4795681Z     
2025-05-07T20:33:22.4795879Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4795885Z 
2025-05-07T20:33:22.4795993Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4796124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4796231Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4796341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4796844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4796948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4797309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4797540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4797885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4797983Z     kernel = self.compile(
2025-05-07T20:33:22.4798365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4798549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4798677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4798682Z 
2025-05-07T20:33:22.4798893Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2082870>
2025-05-07T20:33:22.4799786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4800343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f25c0>}
2025-05-07T20:33:22.4801129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4801323Z context = <triton._C.libtriton.ir.context object at 0x7f33f19356b0>
2025-05-07T20:33:22.4801328Z 
2025-05-07T20:33:22.4801502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4801771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4801893Z                            module_map=module_map)
2025-05-07T20:33:22.4802058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4802167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4802294Z E       ^
2025-05-07T20:33:22.4802654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4802662Z 
2025-05-07T20:33:22.4803078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4803089Z 
2025-05-07T20:33:22.4803197Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4803424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4803512Z     T=128,
2025-05-07T20:33:22.4803593Z     D=7168,
2025-05-07T20:33:22.4803682Z     scale_ub=None,
2025-05-07T20:33:22.4803783Z     contiguous=False,
2025-05-07T20:33:22.4803872Z     compiled=True,
2025-05-07T20:33:22.4803953Z )
2025-05-07T20:33:22.4804183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4804369Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.4804375Z 
2025-05-07T20:33:22.4804469Z     @given(
2025-05-07T20:33:22.4804593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4804699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4804831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4804956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4805078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4805166Z     )
2025-05-07T20:33:22.4806352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4806452Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4806545Z         self,
2025-05-07T20:33:22.4806629Z         T: int,
2025-05-07T20:33:22.4806719Z         D: int,
2025-05-07T20:33:22.4806823Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4806916Z         contiguous: bool,
2025-05-07T20:33:22.4807018Z         compiled: bool,
2025-05-07T20:33:22.4807103Z     ) -> None:
2025-05-07T20:33:22.4807207Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4807296Z     
2025-05-07T20:33:22.4807470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4807553Z     
2025-05-07T20:33:22.4807659Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4807789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4807883Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4807976Z         x0 = x[:, :D]
2025-05-07T20:33:22.4808063Z         x1 = x[:, D:]
2025-05-07T20:33:22.4808143Z     
2025-05-07T20:33:22.4808239Z         if contiguous:
2025-05-07T20:33:22.4808336Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4808436Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4808519Z     
2025-05-07T20:33:22.4808617Z         if scale_ub is not None:
2025-05-07T20:33:22.4808734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4808971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4809055Z             )
2025-05-07T20:33:22.4809146Z         else:
2025-05-07T20:33:22.4809247Z             scale_ub_tensor = None
2025-05-07T20:33:22.4809364Z     
2025-05-07T20:33:22.4809510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4809609Z             op = silu_mul_quant
2025-05-07T20:33:22.4809703Z             if compiled:
2025-05-07T20:33:22.4809812Z                 op = torch.compile(op)
2025-05-07T20:33:22.4809925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4810010Z     
2025-05-07T20:33:22.4810106Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4810229Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4810312Z     
2025-05-07T20:33:22.4810453Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4810559Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4810672Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4810806Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4810990Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4811081Z     
2025-05-07T20:33:22.4811190Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4811195Z 
2025-05-07T20:33:22.4811304Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4811439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4811550Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4811696Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4812256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4812361Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4812732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4812960Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4813339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4813604Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4813981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4814158Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4814502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4814594Z     fn()
2025-05-07T20:33:22.4814995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4815082Z     self.fn.run(
2025-05-07T20:33:22.4815437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4815537Z     kernel = self.compile(
2025-05-07T20:33:22.4815921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4816108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4816240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4816245Z 
2025-05-07T20:33:22.4816463Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f2082570>
2025-05-07T20:33:22.4817244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4817799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f31a0>}
2025-05-07T20:33:22.4818592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4818830Z context = <triton._C.libtriton.ir.context object at 0x7f33f176c4b0>
2025-05-07T20:33:22.4818834Z 
2025-05-07T20:33:22.4819011Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4819280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4819395Z                            module_map=module_map)
2025-05-07T20:33:22.4819584Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4819704Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4819817Z E       ^
2025-05-07T20:33:22.4820221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4820229Z 
2025-05-07T20:33:22.4820649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4820657Z 
2025-05-07T20:33:22.4820774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4821007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4821096Z     T=128,
2025-05-07T20:33:22.4821178Z     D=7168,
2025-05-07T20:33:22.4821267Z     scale_ub=None,
2025-05-07T20:33:22.4821365Z     contiguous=False,
2025-05-07T20:33:22.4821458Z     compiled=False,
2025-05-07T20:33:22.4821539Z )
2025-05-07T20:33:22.4821771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4821951Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.4821956Z 
2025-05-07T20:33:22.4822039Z     @given(
2025-05-07T20:33:22.4822180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4822286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4822414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4822543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4822663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4822753Z     )
2025-05-07T20:33:22.4823001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4823100Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4823191Z         self,
2025-05-07T20:33:22.4823278Z         T: int,
2025-05-07T20:33:22.4823362Z         D: int,
2025-05-07T20:33:22.4823475Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4823570Z         contiguous: bool,
2025-05-07T20:33:22.4823663Z         compiled: bool,
2025-05-07T20:33:22.4823753Z     ) -> None:
2025-05-07T20:33:22.4823852Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4823938Z     
2025-05-07T20:33:22.4824123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4824202Z     
2025-05-07T20:33:22.4824314Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4824447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4824544Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4824639Z         x0 = x[:, :D]
2025-05-07T20:33:22.4824727Z         x1 = x[:, D:]
2025-05-07T20:33:22.4824805Z     
2025-05-07T20:33:22.4824902Z         if contiguous:
2025-05-07T20:33:22.4824998Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4825092Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4825179Z     
2025-05-07T20:33:22.4825275Z         if scale_ub is not None:
2025-05-07T20:33:22.4825387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4825536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4825618Z             )
2025-05-07T20:33:22.4825704Z         else:
2025-05-07T20:33:22.4825888Z             scale_ub_tensor = None
2025-05-07T20:33:22.4825967Z     
2025-05-07T20:33:22.4826116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4826210Z             op = silu_mul_quant
2025-05-07T20:33:22.4826339Z             if compiled:
2025-05-07T20:33:22.4826451Z                 op = torch.compile(op)
2025-05-07T20:33:22.4826561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4826639Z     
2025-05-07T20:33:22.4826749Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4826753Z 
2025-05-07T20:33:22.4826854Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4826996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4827102Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4827206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4827716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4827826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4828251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4828489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4828834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4828941Z     kernel = self.compile(
2025-05-07T20:33:22.4829327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4829507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4829650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4829655Z 
2025-05-07T20:33:22.4829862Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1660590>
2025-05-07T20:33:22.4830660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4831170Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a645e0>}
2025-05-07T20:33:22.4831919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4832119Z context = <triton._C.libtriton.ir.context object at 0x7f33f17a6db0>
2025-05-07T20:33:22.4832124Z 
2025-05-07T20:33:22.4832292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4832569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4832686Z                            module_map=module_map)
2025-05-07T20:33:22.4832849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4832961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4833044Z E       ^
2025-05-07T20:33:22.4833401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4833413Z 
2025-05-07T20:33:22.4833829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4833834Z 
2025-05-07T20:33:22.4833942Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4834180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4834265Z     T=4096,
2025-05-07T20:33:22.4834347Z     D=5120,
2025-05-07T20:33:22.4834441Z     scale_ub=1200.0,
2025-05-07T20:33:22.4834530Z     contiguous=True,
2025-05-07T20:33:22.4834701Z     compiled=False,
2025-05-07T20:33:22.4834788Z )
2025-05-07T20:33:22.4835010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4835194Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.4835238Z 
2025-05-07T20:33:22.4835329Z     @given(
2025-05-07T20:33:22.4835454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4835567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4835688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4835886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4836010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4836089Z     )
2025-05-07T20:33:22.4836337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4836443Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4836525Z         self,
2025-05-07T20:33:22.4836613Z         T: int,
2025-05-07T20:33:22.4836699Z         D: int,
2025-05-07T20:33:22.4836848Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4836951Z         contiguous: bool,
2025-05-07T20:33:22.4837041Z         compiled: bool,
2025-05-07T20:33:22.4837125Z     ) -> None:
2025-05-07T20:33:22.4837228Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4837305Z     
2025-05-07T20:33:22.4837477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4837559Z     
2025-05-07T20:33:22.4837655Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4837784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4837882Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4837965Z         x0 = x[:, :D]
2025-05-07T20:33:22.4838048Z         x1 = x[:, D:]
2025-05-07T20:33:22.4838135Z     
2025-05-07T20:33:22.4838221Z         if contiguous:
2025-05-07T20:33:22.4838316Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4838414Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4838497Z     
2025-05-07T20:33:22.4838599Z         if scale_ub is not None:
2025-05-07T20:33:22.4838710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4838848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4838936Z             )
2025-05-07T20:33:22.4839015Z         else:
2025-05-07T20:33:22.4839113Z             scale_ub_tensor = None
2025-05-07T20:33:22.4839196Z     
2025-05-07T20:33:22.4839330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4839429Z             op = silu_mul_quant
2025-05-07T20:33:22.4839524Z             if compiled:
2025-05-07T20:33:22.4839629Z                 op = torch.compile(op)
2025-05-07T20:33:22.4839739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4839821Z     
2025-05-07T20:33:22.4839915Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4839920Z 
2025-05-07T20:33:22.4840025Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4840160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4840268Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4840377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4840875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4840977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4841348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4841572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4841919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4842016Z     kernel = self.compile(
2025-05-07T20:33:22.4842401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4842677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4842815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4842857Z 
2025-05-07T20:33:22.4843071Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1663fb0>
2025-05-07T20:33:22.4843850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4844355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a651c0>}
2025-05-07T20:33:22.4845111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4845344Z context = <triton._C.libtriton.ir.context object at 0x7f33f0ab3ab0>
2025-05-07T20:33:22.4845349Z 
2025-05-07T20:33:22.4845524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4845797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4845908Z                            module_map=module_map)
2025-05-07T20:33:22.4846077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4846181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4846262Z E       ^
2025-05-07T20:33:22.4846626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4846631Z 
2025-05-07T20:33:22.4847046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4847050Z 
2025-05-07T20:33:22.4847168Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4847397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4847480Z     T=1,
2025-05-07T20:33:22.4847573Z     D=5120,
2025-05-07T20:33:22.4847661Z     scale_ub=None,
2025-05-07T20:33:22.4847755Z     contiguous=True,
2025-05-07T20:33:22.4847845Z     compiled=True,
2025-05-07T20:33:22.4847922Z )
2025-05-07T20:33:22.4848151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4848316Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4848321Z 
2025-05-07T20:33:22.4848402Z     @given(
2025-05-07T20:33:22.4848533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4848637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4848757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4848885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4849009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4849093Z     )
2025-05-07T20:33:22.4849348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4849448Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4849534Z         self,
2025-05-07T20:33:22.4849615Z         T: int,
2025-05-07T20:33:22.4849700Z         D: int,
2025-05-07T20:33:22.4849808Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4849901Z         contiguous: bool,
2025-05-07T20:33:22.4849991Z         compiled: bool,
2025-05-07T20:33:22.4850079Z     ) -> None:
2025-05-07T20:33:22.4850180Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4850261Z     
2025-05-07T20:33:22.4850443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4850523Z     
2025-05-07T20:33:22.4850623Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4850751Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4850893Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4851021Z         x0 = x[:, :D]
2025-05-07T20:33:22.4851109Z         x1 = x[:, D:]
2025-05-07T20:33:22.4851187Z     
2025-05-07T20:33:22.4851278Z         if contiguous:
2025-05-07T20:33:22.4851413Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4851505Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4851587Z     
2025-05-07T20:33:22.4851681Z         if scale_ub is not None:
2025-05-07T20:33:22.4851790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4851932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4852011Z             )
2025-05-07T20:33:22.4852091Z         else:
2025-05-07T20:33:22.4852195Z             scale_ub_tensor = None
2025-05-07T20:33:22.4852273Z     
2025-05-07T20:33:22.4852413Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4852507Z             op = silu_mul_quant
2025-05-07T20:33:22.4852595Z             if compiled:
2025-05-07T20:33:22.4852710Z                 op = torch.compile(op)
2025-05-07T20:33:22.4852826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4852944Z     
2025-05-07T20:33:22.4853046Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4853173Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4853252Z     
2025-05-07T20:33:22.4853396Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4853500Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4853603Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4853734Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4853876Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4853960Z     
2025-05-07T20:33:22.4854063Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4854068Z 
2025-05-07T20:33:22.4854170Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4854309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4854424Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4854564Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4855130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4855236Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4855612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4855839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4856211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4856481Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4856860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4857035Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4857386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4857469Z     fn()
2025-05-07T20:33:22.4857883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4857970Z     self.fn.run(
2025-05-07T20:33:22.4858311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4858414Z     kernel = self.compile(
2025-05-07T20:33:22.4858797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4858977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4859176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4859218Z 
2025-05-07T20:33:22.4859430Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f0f68c80>
2025-05-07T20:33:22.4860267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4860838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a66a20>}
2025-05-07T20:33:22.4861591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4861787Z context = <triton._C.libtriton.ir.context object at 0x7f33f0aa04f0>
2025-05-07T20:33:22.4861792Z 
2025-05-07T20:33:22.4861969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4862282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4862398Z                            module_map=module_map)
2025-05-07T20:33:22.4862570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4862678Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4862761Z E       ^
2025-05-07T20:33:22.4863126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4863131Z 
2025-05-07T20:33:22.4863546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4863550Z 
2025-05-07T20:33:22.4863657Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4863897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4863984Z     T=2048,
2025-05-07T20:33:22.4864071Z     D=5120,
2025-05-07T20:33:22.4864160Z     scale_ub=None,
2025-05-07T20:33:22.4864250Z     contiguous=True,
2025-05-07T20:33:22.4864344Z     compiled=True,
2025-05-07T20:33:22.4864425Z )
2025-05-07T20:33:22.4864648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4864828Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4864833Z 
2025-05-07T20:33:22.4864915Z     @given(
2025-05-07T20:33:22.4865037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4865147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4865266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4865702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4865886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4865976Z     )
2025-05-07T20:33:22.4866235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4866338Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4866417Z         self,
2025-05-07T20:33:22.4866506Z         T: int,
2025-05-07T20:33:22.4866591Z         D: int,
2025-05-07T20:33:22.4866693Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4866794Z         contiguous: bool,
2025-05-07T20:33:22.4866883Z         compiled: bool,
2025-05-07T20:33:22.4866964Z     ) -> None:
2025-05-07T20:33:22.4867070Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4867148Z     
2025-05-07T20:33:22.4867326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4867405Z     
2025-05-07T20:33:22.4867502Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4867636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4867727Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4867814Z         x0 = x[:, :D]
2025-05-07T20:33:22.4867904Z         x1 = x[:, D:]
2025-05-07T20:33:22.4867982Z     
2025-05-07T20:33:22.4868275Z         if contiguous:
2025-05-07T20:33:22.4868380Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4868474Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4868550Z     
2025-05-07T20:33:22.4868711Z         if scale_ub is not None:
2025-05-07T20:33:22.4868823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4868965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4869045Z             )
2025-05-07T20:33:22.4869126Z         else:
2025-05-07T20:33:22.4869229Z             scale_ub_tensor = None
2025-05-07T20:33:22.4869305Z     
2025-05-07T20:33:22.4869438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4869538Z             op = silu_mul_quant
2025-05-07T20:33:22.4869640Z             if compiled:
2025-05-07T20:33:22.4869754Z                 op = torch.compile(op)
2025-05-07T20:33:22.4869896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4869973Z     
2025-05-07T20:33:22.4870074Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4870270Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4870350Z     
2025-05-07T20:33:22.4870498Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4870608Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4870711Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4870844Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4870985Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4871065Z     
2025-05-07T20:33:22.4871177Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4871181Z 
2025-05-07T20:33:22.4871285Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4871420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4871540Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4871678Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4872253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4872358Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4872724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4872959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4873330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4873595Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4873974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4874145Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4874503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4874590Z     fn()
2025-05-07T20:33:22.4874994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4875088Z     self.fn.run(
2025-05-07T20:33:22.4875428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4875535Z     kernel = self.compile(
2025-05-07T20:33:22.4875994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4876172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4876315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4876320Z 
2025-05-07T20:33:22.4876528Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f146d070>
2025-05-07T20:33:22.4877410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4877954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f1a4a020>}
2025-05-07T20:33:22.4878703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4878906Z context = <triton._C.libtriton.ir.context object at 0x7f33f0bebf70>
2025-05-07T20:33:22.4878911Z 
2025-05-07T20:33:22.4879082Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4879360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4879511Z                            module_map=module_map)
2025-05-07T20:33:22.4879681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4879805Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4879885Z E       ^
2025-05-07T20:33:22.4880246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4880260Z 
2025-05-07T20:33:22.4880677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4880682Z 
2025-05-07T20:33:22.4880792Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4881028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4881116Z     T=128,
2025-05-07T20:33:22.4881197Z     D=5120,
2025-05-07T20:33:22.4881290Z     scale_ub=None,
2025-05-07T20:33:22.4881383Z     contiguous=True,
2025-05-07T20:33:22.4881474Z     compiled=True,
2025-05-07T20:33:22.4881558Z )
2025-05-07T20:33:22.4881784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4881966Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4881971Z 
2025-05-07T20:33:22.4882054Z     @given(
2025-05-07T20:33:22.4882178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4882288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4882409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4882530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4882655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4882735Z     )
2025-05-07T20:33:22.4882984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4883089Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4883170Z         self,
2025-05-07T20:33:22.4883264Z         T: int,
2025-05-07T20:33:22.4883345Z         D: int,
2025-05-07T20:33:22.4883450Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4883553Z         contiguous: bool,
2025-05-07T20:33:22.4883648Z         compiled: bool,
2025-05-07T20:33:22.4883731Z     ) -> None:
2025-05-07T20:33:22.4883835Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4883914Z     
2025-05-07T20:33:22.4884088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4884173Z     
2025-05-07T20:33:22.4884269Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4884398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4884496Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4884581Z         x0 = x[:, :D]
2025-05-07T20:33:22.4884676Z         x1 = x[:, D:]
2025-05-07T20:33:22.4884756Z     
2025-05-07T20:33:22.4884845Z         if contiguous:
2025-05-07T20:33:22.4884948Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4885089Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4885204Z     
2025-05-07T20:33:22.4885307Z         if scale_ub is not None:
2025-05-07T20:33:22.4885417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4885554Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4885678Z             )
2025-05-07T20:33:22.4885759Z         else:
2025-05-07T20:33:22.4885857Z             scale_ub_tensor = None
2025-05-07T20:33:22.4885941Z     
2025-05-07T20:33:22.4886075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4886176Z             op = silu_mul_quant
2025-05-07T20:33:22.4886266Z             if compiled:
2025-05-07T20:33:22.4886368Z                 op = torch.compile(op)
2025-05-07T20:33:22.4886481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4886559Z     
2025-05-07T20:33:22.4886654Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4886785Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4886863Z     
2025-05-07T20:33:22.4887008Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4887161Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4887266Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4887395Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4887545Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4887626Z     
2025-05-07T20:33:22.4887737Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4887742Z 
2025-05-07T20:33:22.4887843Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4887978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4888092Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4888229Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4888792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4888909Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4889277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4889511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4889880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4890139Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4890519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4890688Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4891040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4891125Z     fn()
2025-05-07T20:33:22.4891535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4891631Z     self.fn.run(
2025-05-07T20:33:22.4891973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4892070Z     kernel = self.compile(
2025-05-07T20:33:22.4892459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4892636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4892774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4892779Z 
2025-05-07T20:33:22.4892994Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1a11850>
2025-05-07T20:33:22.4893822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4894398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f18400e0>}
2025-05-07T20:33:22.4895184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4895384Z context = <triton._C.libtriton.ir.context object at 0x7f33f0dfa8b0>
2025-05-07T20:33:22.4895389Z 
2025-05-07T20:33:22.4895559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4895827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4895946Z                            module_map=module_map)
2025-05-07T20:33:22.4896116Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4896270Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4896351Z E       ^
2025-05-07T20:33:22.4896709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4896716Z 
2025-05-07T20:33:22.4897140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4897144Z 
2025-05-07T20:33:22.4897252Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4897492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4897575Z     T=4096,
2025-05-07T20:33:22.4897656Z     D=5120,
2025-05-07T20:33:22.4897748Z     scale_ub=None,
2025-05-07T20:33:22.4897839Z     contiguous=True,
2025-05-07T20:33:22.4897925Z     compiled=True,
2025-05-07T20:33:22.4898009Z )
2025-05-07T20:33:22.4898237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4898414Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4898419Z 
2025-05-07T20:33:22.4898507Z     @given(
2025-05-07T20:33:22.4898633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4898744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4898863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4898983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4899106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4899186Z     )
2025-05-07T20:33:22.4899432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4899537Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4899618Z         self,
2025-05-07T20:33:22.4899704Z         T: int,
2025-05-07T20:33:22.4899792Z         D: int,
2025-05-07T20:33:22.4899895Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4899995Z         contiguous: bool,
2025-05-07T20:33:22.4900089Z         compiled: bool,
2025-05-07T20:33:22.4900175Z     ) -> None:
2025-05-07T20:33:22.4900277Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4900358Z     
2025-05-07T20:33:22.4900530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4900614Z     
2025-05-07T20:33:22.4900709Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4900836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4900933Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4901021Z         x0 = x[:, :D]
2025-05-07T20:33:22.4901105Z         x1 = x[:, D:]
2025-05-07T20:33:22.4901187Z     
2025-05-07T20:33:22.4901275Z         if contiguous:
2025-05-07T20:33:22.4901369Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4901466Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4901543Z     
2025-05-07T20:33:22.4901637Z         if scale_ub is not None:
2025-05-07T20:33:22.4901795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4901978Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4902063Z             )
2025-05-07T20:33:22.4902142Z         else:
2025-05-07T20:33:22.4902284Z             scale_ub_tensor = None
2025-05-07T20:33:22.4902367Z     
2025-05-07T20:33:22.4902499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4902594Z             op = silu_mul_quant
2025-05-07T20:33:22.4902688Z             if compiled:
2025-05-07T20:33:22.4902792Z                 op = torch.compile(op)
2025-05-07T20:33:22.4902901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4902985Z     
2025-05-07T20:33:22.4903078Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4903204Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4903287Z     
2025-05-07T20:33:22.4903425Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4903539Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4903649Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4903814Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4903964Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4904051Z     
2025-05-07T20:33:22.4904160Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4904164Z 
2025-05-07T20:33:22.4904272Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4904404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4904519Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4904657Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4905218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4905327Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4905691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4905920Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4906295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4906560Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4906943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4907113Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4907456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4907543Z     fn()
2025-05-07T20:33:22.4907945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4908040Z     self.fn.run(
2025-05-07T20:33:22.4908384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4908482Z     kernel = self.compile(
2025-05-07T20:33:22.4908873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4909054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4909184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4909188Z 
2025-05-07T20:33:22.4909401Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f09138c0>
2025-05-07T20:33:22.4910177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4910737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f0d2b1a0>}
2025-05-07T20:33:22.4911519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4911762Z context = <triton._C.libtriton.ir.context object at 0x7f33f1255170>
2025-05-07T20:33:22.4911766Z 
2025-05-07T20:33:22.4911934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4912204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4912323Z                            module_map=module_map)
2025-05-07T20:33:22.4912489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4912596Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4912682Z E       ^
2025-05-07T20:33:22.4913085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4913090Z 
2025-05-07T20:33:22.4913511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4913518Z 
2025-05-07T20:33:22.4913626Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4913851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4913938Z     T=16384,
2025-05-07T20:33:22.4914018Z     D=5120,
2025-05-07T20:33:22.4914104Z     scale_ub=None,
2025-05-07T20:33:22.4914197Z     contiguous=True,
2025-05-07T20:33:22.4914288Z     compiled=True,
2025-05-07T20:33:22.4914369Z )
2025-05-07T20:33:22.4914589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4914763Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.4914773Z 
2025-05-07T20:33:22.4914859Z     @given(
2025-05-07T20:33:22.4914984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4915086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4915214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4915334Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4915451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4915536Z     )
2025-05-07T20:33:22.4915839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4915942Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4916022Z         self,
2025-05-07T20:33:22.4916103Z         T: int,
2025-05-07T20:33:22.4916191Z         D: int,
2025-05-07T20:33:22.4916292Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4916384Z         contiguous: bool,
2025-05-07T20:33:22.4916481Z         compiled: bool,
2025-05-07T20:33:22.4916562Z     ) -> None:
2025-05-07T20:33:22.4916667Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4916751Z     
2025-05-07T20:33:22.4916924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4917004Z     
2025-05-07T20:33:22.4917108Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4917235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4917332Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4917417Z         x0 = x[:, :D]
2025-05-07T20:33:22.4917499Z         x1 = x[:, D:]
2025-05-07T20:33:22.4917584Z     
2025-05-07T20:33:22.4917670Z         if contiguous:
2025-05-07T20:33:22.4917764Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4917861Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4917937Z     
2025-05-07T20:33:22.4918032Z         if scale_ub is not None:
2025-05-07T20:33:22.4918147Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4918283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4918362Z             )
2025-05-07T20:33:22.4918537Z         else:
2025-05-07T20:33:22.4918642Z             scale_ub_tensor = None
2025-05-07T20:33:22.4918720Z     
2025-05-07T20:33:22.4918857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4918991Z             op = silu_mul_quant
2025-05-07T20:33:22.4919084Z             if compiled:
2025-05-07T20:33:22.4919190Z                 op = torch.compile(op)
2025-05-07T20:33:22.4919299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4919381Z     
2025-05-07T20:33:22.4919480Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4929545Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4929641Z     
2025-05-07T20:33:22.4929796Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4929913Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4930016Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4930146Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4930311Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4930457Z     
2025-05-07T20:33:22.4930564Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4930572Z 
2025-05-07T20:33:22.4930684Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4930821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4930937Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4931074Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4931640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4931751Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4932112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4932340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4932724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4932982Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4933370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4933541Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4933883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4933970Z     fn()
2025-05-07T20:33:22.4934370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4934464Z     self.fn.run(
2025-05-07T20:33:22.4934802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4934906Z     kernel = self.compile(
2025-05-07T20:33:22.4935298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4935474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4935605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4935610Z 
2025-05-07T20:33:22.4935826Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f090aba0>
2025-05-07T20:33:22.4936609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4937123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f0171300>}
2025-05-07T20:33:22.4937990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4938229Z context = <triton._C.libtriton.ir.context object at 0x7f33f03b1c30>
2025-05-07T20:33:22.4938234Z 
2025-05-07T20:33:22.4938402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4938668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4938785Z                            module_map=module_map)
2025-05-07T20:33:22.4938948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4939056Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4939142Z E       ^
2025-05-07T20:33:22.4939499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4939509Z 
2025-05-07T20:33:22.4939969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4939975Z 
2025-05-07T20:33:22.4940084Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4940309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4940397Z     T=1,
2025-05-07T20:33:22.4940478Z     D=5120,
2025-05-07T20:33:22.4940564Z     scale_ub=1200.0,
2025-05-07T20:33:22.4940658Z     contiguous=True,
2025-05-07T20:33:22.4940743Z     compiled=True,
2025-05-07T20:33:22.4940832Z )
2025-05-07T20:33:22.4941051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4941218Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.4941223Z 
2025-05-07T20:33:22.4941310Z     @given(
2025-05-07T20:33:22.4941433Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4941538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4941666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4941783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4941898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4941974Z     )
2025-05-07T20:33:22.4942215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4942310Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4942385Z         self,
2025-05-07T20:33:22.4942464Z         T: int,
2025-05-07T20:33:22.4942555Z         D: int,
2025-05-07T20:33:22.4942657Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4942749Z         contiguous: bool,
2025-05-07T20:33:22.4942845Z         compiled: bool,
2025-05-07T20:33:22.4942927Z     ) -> None:
2025-05-07T20:33:22.4943023Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4943105Z     
2025-05-07T20:33:22.4943279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4943356Z     
2025-05-07T20:33:22.4943459Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4943587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4943686Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4943769Z         x0 = x[:, :D]
2025-05-07T20:33:22.4943852Z         x1 = x[:, D:]
2025-05-07T20:33:22.4943937Z     
2025-05-07T20:33:22.4944024Z         if contiguous:
2025-05-07T20:33:22.4944117Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4944212Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4944287Z     
2025-05-07T20:33:22.4944380Z         if scale_ub is not None:
2025-05-07T20:33:22.4944494Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4944631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4944709Z             )
2025-05-07T20:33:22.4944791Z         else:
2025-05-07T20:33:22.4944889Z             scale_ub_tensor = None
2025-05-07T20:33:22.4944969Z     
2025-05-07T20:33:22.4945187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4945284Z             op = silu_mul_quant
2025-05-07T20:33:22.4945380Z             if compiled:
2025-05-07T20:33:22.4945482Z                 op = torch.compile(op)
2025-05-07T20:33:22.4945631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4945712Z     
2025-05-07T20:33:22.4945804Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4945809Z 
2025-05-07T20:33:22.4945907Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4946045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4946147Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4946256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4946622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.4946717Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.4947218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4947359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4947721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4947952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4948292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4948396Z     kernel = self.compile(
2025-05-07T20:33:22.4948778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4948955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4949090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4949095Z 
2025-05-07T20:33:22.4949305Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f090ac00>
2025-05-07T20:33:22.4950099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4950610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2c720>}
2025-05-07T20:33:22.4951356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4951556Z context = <triton._C.libtriton.ir.context object at 0x7f32cfac4a30>
2025-05-07T20:33:22.4951561Z 
2025-05-07T20:33:22.4951727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4952004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4952115Z                            module_map=module_map)
2025-05-07T20:33:22.4952280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4952389Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4952468Z E       ^
2025-05-07T20:33:22.4952830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4952841Z 
2025-05-07T20:33:22.4953253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4953258Z 
2025-05-07T20:33:22.4953363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4953592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4953671Z     T=1,
2025-05-07T20:33:22.4953757Z     D=5120,
2025-05-07T20:33:22.4953929Z     scale_ub=None,
2025-05-07T20:33:22.4954020Z     contiguous=False,
2025-05-07T20:33:22.4954107Z     compiled=True,
2025-05-07T20:33:22.4954189Z )
2025-05-07T20:33:22.4954408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4954622Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.4954627Z 
2025-05-07T20:33:22.4954708Z     @given(
2025-05-07T20:33:22.4954829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4954934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4955057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4955174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4955301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4955382Z     )
2025-05-07T20:33:22.4955629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4955808Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4955895Z         self,
2025-05-07T20:33:22.4955981Z         T: int,
2025-05-07T20:33:22.4956103Z         D: int,
2025-05-07T20:33:22.4956205Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4956307Z         contiguous: bool,
2025-05-07T20:33:22.4956395Z         compiled: bool,
2025-05-07T20:33:22.4956475Z     ) -> None:
2025-05-07T20:33:22.4956577Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4956652Z     
2025-05-07T20:33:22.4956822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4956909Z     
2025-05-07T20:33:22.4957002Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4957128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4957224Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4957305Z         x0 = x[:, :D]
2025-05-07T20:33:22.4957394Z         x1 = x[:, D:]
2025-05-07T20:33:22.4957468Z     
2025-05-07T20:33:22.4957554Z         if contiguous:
2025-05-07T20:33:22.4957657Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4957750Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4957830Z     
2025-05-07T20:33:22.4957930Z         if scale_ub is not None:
2025-05-07T20:33:22.4958039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4958177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4958262Z             )
2025-05-07T20:33:22.4958341Z         else:
2025-05-07T20:33:22.4958437Z             scale_ub_tensor = None
2025-05-07T20:33:22.4958519Z     
2025-05-07T20:33:22.4958650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4958748Z             op = silu_mul_quant
2025-05-07T20:33:22.4958834Z             if compiled:
2025-05-07T20:33:22.4958937Z                 op = torch.compile(op)
2025-05-07T20:33:22.4959048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4959123Z     
2025-05-07T20:33:22.4959216Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.4959351Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.4959428Z     
2025-05-07T20:33:22.4959566Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4959675Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.4959778Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.4959901Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.4960055Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4960130Z     
2025-05-07T20:33:22.4960242Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.4960247Z 
2025-05-07T20:33:22.4960346Z moe/activation_test.py:126: 
2025-05-07T20:33:22.4960478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4960589Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.4960724Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.4961329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.4961478Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.4961834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4962103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4962472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.4962730Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.4963113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.4963280Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.4963630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.4963718Z     fn()
2025-05-07T20:33:22.4964185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.4964279Z     self.fn.run(
2025-05-07T20:33:22.4964616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4964712Z     kernel = self.compile(
2025-05-07T20:33:22.4965097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4965270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4965623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4965631Z 
2025-05-07T20:33:22.4965916Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f01a85c0>
2025-05-07T20:33:22.4966707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4967221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2e8e0>}
2025-05-07T20:33:22.4967969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4968167Z context = <triton._C.libtriton.ir.context object at 0x7f32cfa1d7f0>
2025-05-07T20:33:22.4968172Z 
2025-05-07T20:33:22.4968337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4968599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4968720Z                            module_map=module_map)
2025-05-07T20:33:22.4968891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4969003Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.4969084Z E       ^
2025-05-07T20:33:22.4969441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4969446Z 
2025-05-07T20:33:22.4969867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4969872Z 
2025-05-07T20:33:22.4969975Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4970208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4970289Z     T=1,
2025-05-07T20:33:22.4970371Z     D=5120,
2025-05-07T20:33:22.4970463Z     scale_ub=None,
2025-05-07T20:33:22.4970551Z     contiguous=True,
2025-05-07T20:33:22.4970635Z     compiled=False,
2025-05-07T20:33:22.4970716Z )
2025-05-07T20:33:22.4971105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4971274Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.4971339Z 
2025-05-07T20:33:22.4971417Z     @given(
2025-05-07T20:33:22.4971546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4971647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4971767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4971888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4972003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4972087Z     )
2025-05-07T20:33:22.4972331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4972425Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4972509Z         self,
2025-05-07T20:33:22.4972587Z         T: int,
2025-05-07T20:33:22.4972665Z         D: int,
2025-05-07T20:33:22.4972777Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4972867Z         contiguous: bool,
2025-05-07T20:33:22.4973010Z         compiled: bool,
2025-05-07T20:33:22.4973096Z     ) -> None:
2025-05-07T20:33:22.4973196Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4973280Z     
2025-05-07T20:33:22.4973449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4973524Z     
2025-05-07T20:33:22.4973623Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4973748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4973838Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4973926Z         x0 = x[:, :D]
2025-05-07T20:33:22.4974007Z         x1 = x[:, D:]
2025-05-07T20:33:22.4974080Z     
2025-05-07T20:33:22.4974172Z         if contiguous:
2025-05-07T20:33:22.4974266Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4974356Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4974434Z     
2025-05-07T20:33:22.4974529Z         if scale_ub is not None:
2025-05-07T20:33:22.4974638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4974779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4974857Z             )
2025-05-07T20:33:22.4974944Z         else:
2025-05-07T20:33:22.4975041Z             scale_ub_tensor = None
2025-05-07T20:33:22.4975114Z     
2025-05-07T20:33:22.4975253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4975348Z             op = silu_mul_quant
2025-05-07T20:33:22.4975436Z             if compiled:
2025-05-07T20:33:22.4975542Z                 op = torch.compile(op)
2025-05-07T20:33:22.4975647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4975721Z     
2025-05-07T20:33:22.4975821Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4975825Z 
2025-05-07T20:33:22.4975925Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4976059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4976168Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4976269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4976771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4976872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4977230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4977459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4977796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4977893Z     kernel = self.compile(
2025-05-07T20:33:22.4978271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4978446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4978669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4978674Z 
2025-05-07T20:33:22.4978880Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f1836420>
2025-05-07T20:33:22.4979705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4980209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd2dda0>}
2025-05-07T20:33:22.4980952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4981152Z context = <triton._C.libtriton.ir.context object at 0x7f32cf81e570>
2025-05-07T20:33:22.4981158Z 
2025-05-07T20:33:22.4981361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4981630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4981745Z                            module_map=module_map)
2025-05-07T20:33:22.4981908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4982017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4982096Z E       ^
2025-05-07T20:33:22.4982451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4982460Z 
2025-05-07T20:33:22.4982871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4982875Z 
2025-05-07T20:33:22.4982978Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4983217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4983303Z     T=128,
2025-05-07T20:33:22.4983383Z     D=5120,
2025-05-07T20:33:22.4983473Z     scale_ub=None,
2025-05-07T20:33:22.4983566Z     contiguous=False,
2025-05-07T20:33:22.4983651Z     compiled=True,
2025-05-07T20:33:22.4983732Z )
2025-05-07T20:33:22.4983953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4984129Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.4984133Z 
2025-05-07T20:33:22.4984213Z     @given(
2025-05-07T20:33:22.4984333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4984442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4984559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4984677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4984798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4984884Z     )
2025-05-07T20:33:22.4985133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4985234Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4985319Z         self,
2025-05-07T20:33:22.4985405Z         T: int,
2025-05-07T20:33:22.4985484Z         D: int,
2025-05-07T20:33:22.4985586Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4985683Z         contiguous: bool,
2025-05-07T20:33:22.4985771Z         compiled: bool,
2025-05-07T20:33:22.4985852Z     ) -> None:
2025-05-07T20:33:22.4985953Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4986031Z     
2025-05-07T20:33:22.4986199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4986281Z     
2025-05-07T20:33:22.4986376Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.4986503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.4986600Z         x = x_sign * x_clamp
2025-05-07T20:33:22.4986682Z         x0 = x[:, :D]
2025-05-07T20:33:22.4986864Z         x1 = x[:, D:]
2025-05-07T20:33:22.4986940Z     
2025-05-07T20:33:22.4987030Z         if contiguous:
2025-05-07T20:33:22.4987128Z             x0 = x0.contiguous()
2025-05-07T20:33:22.4987263Z             x1 = x1.contiguous()
2025-05-07T20:33:22.4987338Z     
2025-05-07T20:33:22.4987436Z         if scale_ub is not None:
2025-05-07T20:33:22.4987543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.4987681Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.4987764Z             )
2025-05-07T20:33:22.4987843Z         else:
2025-05-07T20:33:22.4987939Z             scale_ub_tensor = None
2025-05-07T20:33:22.4988017Z     
2025-05-07T20:33:22.4988146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.4988244Z             op = silu_mul_quant
2025-05-07T20:33:22.4988330Z             if compiled:
2025-05-07T20:33:22.4988430Z                 op = torch.compile(op)
2025-05-07T20:33:22.4988542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4988618Z     
2025-05-07T20:33:22.4988749Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.4988754Z 
2025-05-07T20:33:22.4988858Z moe/activation_test.py:117: 
2025-05-07T20:33:22.4988992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4989096Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.4989200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.4989566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.4989676Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.4990205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.4990302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.4990663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.4990893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.4991229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.4991330Z     kernel = self.compile(
2025-05-07T20:33:22.4991709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.4991891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.4992020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.4992024Z 
2025-05-07T20:33:22.4992231Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cfd47020>
2025-05-07T20:33:22.4993016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.4993523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f01a3920>}
2025-05-07T20:33:22.4994276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.4994468Z context = <triton._C.libtriton.ir.context object at 0x7f32cf8bffb0>
2025-05-07T20:33:22.4994472Z 
2025-05-07T20:33:22.4994638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.4994906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.4995014Z                            module_map=module_map)
2025-05-07T20:33:22.4995181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.4995366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.4995448Z E       ^
2025-05-07T20:33:22.4995915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.4995995Z 
2025-05-07T20:33:22.4996410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.4996415Z 
2025-05-07T20:33:22.4996525Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.4996750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.4996832Z     T=128,
2025-05-07T20:33:22.4996921Z     D=7168,
2025-05-07T20:33:22.4997007Z     scale_ub=1200.0,
2025-05-07T20:33:22.4997097Z     contiguous=False,
2025-05-07T20:33:22.4997189Z     compiled=False,
2025-05-07T20:33:22.4997265Z )
2025-05-07T20:33:22.4997485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.4997668Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.4997675Z 
2025-05-07T20:33:22.4997800Z     @given(
2025-05-07T20:33:22.4997925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.4998034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.4998151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.4998274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.4998390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.4998468Z     )
2025-05-07T20:33:22.4998721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.4998816Z     def test_silu_mul_quant(
2025-05-07T20:33:22.4998898Z         self,
2025-05-07T20:33:22.4998981Z         T: int,
2025-05-07T20:33:22.4999062Z         D: int,
2025-05-07T20:33:22.4999163Z         scale_ub: Optional[float],
2025-05-07T20:33:22.4999260Z         contiguous: bool,
2025-05-07T20:33:22.4999351Z         compiled: bool,
2025-05-07T20:33:22.4999439Z     ) -> None:
2025-05-07T20:33:22.4999538Z         torch.manual_seed(2025)
2025-05-07T20:33:22.4999612Z     
2025-05-07T20:33:22.4999792Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.4999889Z     
2025-05-07T20:33:22.4999997Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5000140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5000231Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5000312Z         x0 = x[:, :D]
2025-05-07T20:33:22.5000399Z         x1 = x[:, D:]
2025-05-07T20:33:22.5000473Z     
2025-05-07T20:33:22.5000558Z         if contiguous:
2025-05-07T20:33:22.5000657Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5000746Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5000818Z     
2025-05-07T20:33:22.5000920Z         if scale_ub is not None:
2025-05-07T20:33:22.5001027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5001169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5001247Z             )
2025-05-07T20:33:22.5001329Z         else:
2025-05-07T20:33:22.5001429Z             scale_ub_tensor = None
2025-05-07T20:33:22.5001506Z     
2025-05-07T20:33:22.5001636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5001732Z             op = silu_mul_quant
2025-05-07T20:33:22.5001818Z             if compiled:
2025-05-07T20:33:22.5001917Z                 op = torch.compile(op)
2025-05-07T20:33:22.5002028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5002102Z     
2025-05-07T20:33:22.5002193Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5002203Z 
2025-05-07T20:33:22.5002300Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5002433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5002538Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5002644Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5003240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5003354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5003876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5004166Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5004507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5004603Z     kernel = self.compile(
2025-05-07T20:33:22.5004987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5005161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5005289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5005298Z 
2025-05-07T20:33:22.5005561Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cfd454f0>
2025-05-07T20:33:22.5006339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5006852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f16f16c0>}
2025-05-07T20:33:22.5007596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5007792Z context = <triton._C.libtriton.ir.context object at 0x7f32cf820bb0>
2025-05-07T20:33:22.5007796Z 
2025-05-07T20:33:22.5007965Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5008232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5008347Z                            module_map=module_map)
2025-05-07T20:33:22.5008510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5008612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5008695Z E       ^
2025-05-07T20:33:22.5009048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5009053Z 
2025-05-07T20:33:22.5009469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5009474Z 
2025-05-07T20:33:22.5009579Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5009804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5009894Z     T=128,
2025-05-07T20:33:22.5009982Z     D=5120,
2025-05-07T20:33:22.5010069Z     scale_ub=None,
2025-05-07T20:33:22.5010164Z     contiguous=False,
2025-05-07T20:33:22.5010250Z     compiled=False,
2025-05-07T20:33:22.5010334Z )
2025-05-07T20:33:22.5010555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5010727Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5010732Z 
2025-05-07T20:33:22.5010818Z     @given(
2025-05-07T20:33:22.5010939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5011040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5011160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5011277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5011391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5011473Z     )
2025-05-07T20:33:22.5011714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5011899Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5011979Z         self,
2025-05-07T20:33:22.5012061Z         T: int,
2025-05-07T20:33:22.5012147Z         D: int,
2025-05-07T20:33:22.5012285Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5012377Z         contiguous: bool,
2025-05-07T20:33:22.5012469Z         compiled: bool,
2025-05-07T20:33:22.5012551Z     ) -> None:
2025-05-07T20:33:22.5012646Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5012726Z     
2025-05-07T20:33:22.5012894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5012970Z     
2025-05-07T20:33:22.5013067Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5013193Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5013289Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5013373Z         x0 = x[:, :D]
2025-05-07T20:33:22.5013456Z         x1 = x[:, D:]
2025-05-07T20:33:22.5013536Z     
2025-05-07T20:33:22.5013622Z         if contiguous:
2025-05-07T20:33:22.5013722Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5013856Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5013936Z     
2025-05-07T20:33:22.5014030Z         if scale_ub is not None:
2025-05-07T20:33:22.5014146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5014279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5014355Z             )
2025-05-07T20:33:22.5014436Z         else:
2025-05-07T20:33:22.5014531Z             scale_ub_tensor = None
2025-05-07T20:33:22.5014605Z     
2025-05-07T20:33:22.5014742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5014831Z             op = silu_mul_quant
2025-05-07T20:33:22.5014924Z             if compiled:
2025-05-07T20:33:22.5015023Z                 op = torch.compile(op)
2025-05-07T20:33:22.5015129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5015208Z     
2025-05-07T20:33:22.5015299Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5015309Z 
2025-05-07T20:33:22.5015406Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5015546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5015646Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5015750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5016248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5016346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5016708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5016929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5017268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5017367Z     kernel = self.compile(
2025-05-07T20:33:22.5017753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5017933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5018063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5018068Z 
2025-05-07T20:33:22.5018271Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf841a00>
2025-05-07T20:33:22.5019052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5019559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cfd3fba0>}
2025-05-07T20:33:22.5020404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5020634Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9c1eb0>
2025-05-07T20:33:22.5020675Z 
2025-05-07T20:33:22.5020838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5021110Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5021222Z                            module_map=module_map)
2025-05-07T20:33:22.5021385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5021486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5021563Z E       ^
2025-05-07T20:33:22.5021922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5021926Z 
2025-05-07T20:33:22.5022339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5022386Z 
2025-05-07T20:33:22.5022498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5022721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5022805Z     T=128,
2025-05-07T20:33:22.5022891Z     D=5120,
2025-05-07T20:33:22.5022978Z     scale_ub=1200.0,
2025-05-07T20:33:22.5023064Z     contiguous=True,
2025-05-07T20:33:22.5023154Z     compiled=False,
2025-05-07T20:33:22.5023231Z )
2025-05-07T20:33:22.5023449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5023624Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5023629Z 
2025-05-07T20:33:22.5023707Z     @given(
2025-05-07T20:33:22.5023835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5023935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5024055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5024184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5024300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5024381Z     )
2025-05-07T20:33:22.5024630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5024726Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5024808Z         self,
2025-05-07T20:33:22.5024892Z         T: int,
2025-05-07T20:33:22.5024971Z         D: int,
2025-05-07T20:33:22.5025074Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5025165Z         contiguous: bool,
2025-05-07T20:33:22.5025253Z         compiled: bool,
2025-05-07T20:33:22.5025337Z     ) -> None:
2025-05-07T20:33:22.5025434Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5025507Z     
2025-05-07T20:33:22.5025680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5025754Z     
2025-05-07T20:33:22.5025850Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5025984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5026075Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5026158Z         x0 = x[:, :D]
2025-05-07T20:33:22.5026251Z         x1 = x[:, D:]
2025-05-07T20:33:22.5026326Z     
2025-05-07T20:33:22.5026411Z         if contiguous:
2025-05-07T20:33:22.5026510Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5026602Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5026683Z     
2025-05-07T20:33:22.5026775Z         if scale_ub is not None:
2025-05-07T20:33:22.5026883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5027022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5027101Z             )
2025-05-07T20:33:22.5027180Z         else:
2025-05-07T20:33:22.5027282Z             scale_ub_tensor = None
2025-05-07T20:33:22.5027356Z     
2025-05-07T20:33:22.5027485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5027689Z             op = silu_mul_quant
2025-05-07T20:33:22.5027777Z             if compiled:
2025-05-07T20:33:22.5027879Z                 op = torch.compile(op)
2025-05-07T20:33:22.5027990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5028107Z     
2025-05-07T20:33:22.5028201Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5028206Z 
2025-05-07T20:33:22.5028303Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5028433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5028537Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5028636Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5029131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5029231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5029587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5029883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5030247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5030345Z     kernel = self.compile(
2025-05-07T20:33:22.5030731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5030905Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5031031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5031040Z 
2025-05-07T20:33:22.5031242Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843500>
2025-05-07T20:33:22.5032017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5032527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078cb80>}
2025-05-07T20:33:22.5033271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5033464Z context = <triton._C.libtriton.ir.context object at 0x7f33f07b78f0>
2025-05-07T20:33:22.5033468Z 
2025-05-07T20:33:22.5033633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5033893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5034006Z                            module_map=module_map)
2025-05-07T20:33:22.5034167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5034278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5034356Z E       ^
2025-05-07T20:33:22.5034713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5034720Z 
2025-05-07T20:33:22.5035135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5035140Z 
2025-05-07T20:33:22.5035243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5035465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5035551Z     T=1,
2025-05-07T20:33:22.5035630Z     D=7168,
2025-05-07T20:33:22.5035790Z     scale_ub=1200.0,
2025-05-07T20:33:22.5035878Z     contiguous=True,
2025-05-07T20:33:22.5035962Z     compiled=True,
2025-05-07T20:33:22.5036039Z )
2025-05-07T20:33:22.5036259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5036526Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5036534Z 
2025-05-07T20:33:22.5036617Z     @given(
2025-05-07T20:33:22.5036736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5036875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5036993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5037110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5037228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5037301Z     )
2025-05-07T20:33:22.5037545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5037644Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5037722Z         self,
2025-05-07T20:33:22.5037801Z         T: int,
2025-05-07T20:33:22.5037885Z         D: int,
2025-05-07T20:33:22.5037985Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5038076Z         contiguous: bool,
2025-05-07T20:33:22.5038174Z         compiled: bool,
2025-05-07T20:33:22.5038253Z     ) -> None:
2025-05-07T20:33:22.5038388Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5038468Z     
2025-05-07T20:33:22.5038636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5038719Z     
2025-05-07T20:33:22.5038812Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5038938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5039030Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5039111Z         x0 = x[:, :D]
2025-05-07T20:33:22.5039191Z         x1 = x[:, D:]
2025-05-07T20:33:22.5039268Z     
2025-05-07T20:33:22.5039354Z         if contiguous:
2025-05-07T20:33:22.5039450Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5039546Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5039619Z     
2025-05-07T20:33:22.5039712Z         if scale_ub is not None:
2025-05-07T20:33:22.5039822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5039958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5040039Z             )
2025-05-07T20:33:22.5040129Z         else:
2025-05-07T20:33:22.5040227Z             scale_ub_tensor = None
2025-05-07T20:33:22.5040307Z     
2025-05-07T20:33:22.5040436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5040527Z             op = silu_mul_quant
2025-05-07T20:33:22.5040617Z             if compiled:
2025-05-07T20:33:22.5040716Z                 op = torch.compile(op)
2025-05-07T20:33:22.5040821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5040895Z     
2025-05-07T20:33:22.5040985Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5040989Z 
2025-05-07T20:33:22.5041086Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5041222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5041325Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5041429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5041801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5041894Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5042385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5042485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5042842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5043067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5043405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5043502Z     kernel = self.compile(
2025-05-07T20:33:22.5043881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5044139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5044277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5044320Z 
2025-05-07T20:33:22.5044525Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843050>
2025-05-07T20:33:22.5045305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5045806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078e2a0>}
2025-05-07T20:33:22.5046552Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5046788Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9162f0>
2025-05-07T20:33:22.5046792Z 
2025-05-07T20:33:22.5046956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5047223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5047331Z                            module_map=module_map)
2025-05-07T20:33:22.5047494Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5047598Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5047676Z E       ^
2025-05-07T20:33:22.5048036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5048040Z 
2025-05-07T20:33:22.5048452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5048457Z 
2025-05-07T20:33:22.5048564Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5048795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5048875Z     T=1,
2025-05-07T20:33:22.5048957Z     D=7168,
2025-05-07T20:33:22.5049043Z     scale_ub=1200.0,
2025-05-07T20:33:22.5049130Z     contiguous=False,
2025-05-07T20:33:22.5049221Z     compiled=True,
2025-05-07T20:33:22.5049294Z )
2025-05-07T20:33:22.5049511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5049712Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5049717Z 
2025-05-07T20:33:22.5049814Z     @given(
2025-05-07T20:33:22.5049935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5050041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5050156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5050273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5050396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5050472Z     )
2025-05-07T20:33:22.5050721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5050817Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5050893Z         self,
2025-05-07T20:33:22.5050973Z         T: int,
2025-05-07T20:33:22.5051050Z         D: int,
2025-05-07T20:33:22.5051148Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5051240Z         contiguous: bool,
2025-05-07T20:33:22.5051327Z         compiled: bool,
2025-05-07T20:33:22.5051405Z     ) -> None:
2025-05-07T20:33:22.5051504Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5051576Z     
2025-05-07T20:33:22.5051742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5051821Z     
2025-05-07T20:33:22.5051912Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5052039Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5052178Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5052298Z         x0 = x[:, :D]
2025-05-07T20:33:22.5052385Z         x1 = x[:, D:]
2025-05-07T20:33:22.5052462Z     
2025-05-07T20:33:22.5052552Z         if contiguous:
2025-05-07T20:33:22.5052687Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5052777Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5056267Z     
2025-05-07T20:33:22.5056378Z         if scale_ub is not None:
2025-05-07T20:33:22.5056495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5056636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5056713Z             )
2025-05-07T20:33:22.5056799Z         else:
2025-05-07T20:33:22.5056895Z             scale_ub_tensor = None
2025-05-07T20:33:22.5056969Z     
2025-05-07T20:33:22.5057106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5057199Z             op = silu_mul_quant
2025-05-07T20:33:22.5057285Z             if compiled:
2025-05-07T20:33:22.5057399Z                 op = torch.compile(op)
2025-05-07T20:33:22.5057509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5057649Z     
2025-05-07T20:33:22.5057747Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5057756Z 
2025-05-07T20:33:22.5057852Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5057985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5058086Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5058186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5058562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5058655Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5059146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5059249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5059609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5059866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5060229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5060324Z     kernel = self.compile(
2025-05-07T20:33:22.5060707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5060880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5061012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5061017Z 
2025-05-07T20:33:22.5061222Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf843e60>
2025-05-07T20:33:22.5062002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5062511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f33f078f9c0>}
2025-05-07T20:33:22.5063257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5063451Z context = <triton._C.libtriton.ir.context object at 0x7f32cffed670>
2025-05-07T20:33:22.5063455Z 
2025-05-07T20:33:22.5063618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5063880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5063989Z                            module_map=module_map)
2025-05-07T20:33:22.5064263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5064375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5064452Z E       ^
2025-05-07T20:33:22.5064808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5064852Z 
2025-05-07T20:33:22.5065267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5065272Z 
2025-05-07T20:33:22.5065582Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5065893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5065976Z     T=1,
2025-05-07T20:33:22.5066056Z     D=7168,
2025-05-07T20:33:22.5066142Z     scale_ub=None,
2025-05-07T20:33:22.5066230Z     contiguous=False,
2025-05-07T20:33:22.5066314Z     compiled=True,
2025-05-07T20:33:22.5066392Z )
2025-05-07T20:33:22.5066613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5066865Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5066871Z 
2025-05-07T20:33:22.5066957Z     @given(
2025-05-07T20:33:22.5067083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5067182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5067301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5067418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5067533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5067606Z     )
2025-05-07T20:33:22.5067849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5067945Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5068021Z         self,
2025-05-07T20:33:22.5068099Z         T: int,
2025-05-07T20:33:22.5068182Z         D: int,
2025-05-07T20:33:22.5068280Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5068373Z         contiguous: bool,
2025-05-07T20:33:22.5068464Z         compiled: bool,
2025-05-07T20:33:22.5068550Z     ) -> None:
2025-05-07T20:33:22.5068643Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5068721Z     
2025-05-07T20:33:22.5068890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5068969Z     
2025-05-07T20:33:22.5069059Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5069184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5069279Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5069358Z         x0 = x[:, :D]
2025-05-07T20:33:22.5069442Z         x1 = x[:, D:]
2025-05-07T20:33:22.5069518Z     
2025-05-07T20:33:22.5069602Z         if contiguous:
2025-05-07T20:33:22.5069694Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5069788Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5069861Z     
2025-05-07T20:33:22.5069951Z         if scale_ub is not None:
2025-05-07T20:33:22.5070062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5070199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5070278Z             )
2025-05-07T20:33:22.5070354Z         else:
2025-05-07T20:33:22.5070449Z             scale_ub_tensor = None
2025-05-07T20:33:22.5070525Z     
2025-05-07T20:33:22.5070653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5070743Z             op = silu_mul_quant
2025-05-07T20:33:22.5070834Z             if compiled:
2025-05-07T20:33:22.5070934Z                 op = torch.compile(op)
2025-05-07T20:33:22.5071039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5071117Z     
2025-05-07T20:33:22.5071206Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.5071326Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.5071408Z     
2025-05-07T20:33:22.5071540Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5071645Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.5071869Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.5072000Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.5072141Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.5072274Z     
2025-05-07T20:33:22.5072376Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.5072381Z 
2025-05-07T20:33:22.5072482Z moe/activation_test.py:126: 
2025-05-07T20:33:22.5072611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5072718Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.5072854Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.5073410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.5073515Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.5073875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5074141Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5074513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.5074772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.5075150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.5075320Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.5075660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.5075800Z     fn()
2025-05-07T20:33:22.5076201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.5076290Z     self.fn.run(
2025-05-07T20:33:22.5076635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5076728Z     kernel = self.compile(
2025-05-07T20:33:22.5077114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5077288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5077417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5077421Z 
2025-05-07T20:33:22.5077628Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffeb170>
2025-05-07T20:33:22.5078403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5078913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5cb80>}
2025-05-07T20:33:22.5079658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5079850Z context = <triton._C.libtriton.ir.context object at 0x7f32cffb5f70>
2025-05-07T20:33:22.5079860Z 
2025-05-07T20:33:22.5080022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5080284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5080396Z                            module_map=module_map)
2025-05-07T20:33:22.5080559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5080662Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.5080746Z E       ^
2025-05-07T20:33:22.5081190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5081195Z 
2025-05-07T20:33:22.5081612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5081655Z 
2025-05-07T20:33:22.5081758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5081979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5082063Z     T=1,
2025-05-07T20:33:22.5082140Z     D=5120,
2025-05-07T20:33:22.5082223Z     scale_ub=1200.0,
2025-05-07T20:33:22.5082312Z     contiguous=False,
2025-05-07T20:33:22.5082395Z     compiled=True,
2025-05-07T20:33:22.5082468Z )
2025-05-07T20:33:22.5082687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5082853Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5082860Z 
2025-05-07T20:33:22.5082944Z     @given(
2025-05-07T20:33:22.5083101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5083202Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5083323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5083440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5083553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5083632Z     )
2025-05-07T20:33:22.5083873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5083970Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5084046Z         self,
2025-05-07T20:33:22.5084124Z         T: int,
2025-05-07T20:33:22.5084205Z         D: int,
2025-05-07T20:33:22.5084304Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5084391Z         contiguous: bool,
2025-05-07T20:33:22.5084479Z         compiled: bool,
2025-05-07T20:33:22.5084558Z     ) -> None:
2025-05-07T20:33:22.5084659Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5084737Z     
2025-05-07T20:33:22.5084912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5084987Z     
2025-05-07T20:33:22.5085085Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5085209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5085296Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5085382Z         x0 = x[:, :D]
2025-05-07T20:33:22.5085462Z         x1 = x[:, D:]
2025-05-07T20:33:22.5085538Z     
2025-05-07T20:33:22.5085622Z         if contiguous:
2025-05-07T20:33:22.5085713Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5085803Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5085876Z     
2025-05-07T20:33:22.5085967Z         if scale_ub is not None:
2025-05-07T20:33:22.5086077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5086211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5086285Z             )
2025-05-07T20:33:22.5086371Z         else:
2025-05-07T20:33:22.5086466Z             scale_ub_tensor = None
2025-05-07T20:33:22.5086541Z     
2025-05-07T20:33:22.5086674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5086767Z             op = silu_mul_quant
2025-05-07T20:33:22.5086855Z             if compiled:
2025-05-07T20:33:22.5086953Z                 op = torch.compile(op)
2025-05-07T20:33:22.5087057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5087132Z     
2025-05-07T20:33:22.5087224Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5087229Z 
2025-05-07T20:33:22.5087324Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5087459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5087558Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5087658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5088074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5088203Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5088699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5088838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5089192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5089416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5089752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5089846Z     kernel = self.compile(
2025-05-07T20:33:22.5090228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5090399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5090534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5090610Z 
2025-05-07T20:33:22.5090815Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffeab40>
2025-05-07T20:33:22.5091593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5092101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5de40>}
2025-05-07T20:33:22.5092844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5093039Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9b0f30>
2025-05-07T20:33:22.5093046Z 
2025-05-07T20:33:22.5093211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5093474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5093582Z                            module_map=module_map)
2025-05-07T20:33:22.5093742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5093849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5093926Z E       ^
2025-05-07T20:33:22.5094280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5094285Z 
2025-05-07T20:33:22.5094698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5094703Z 
2025-05-07T20:33:22.5094803Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5095036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5095120Z     T=1,
2025-05-07T20:33:22.5095199Z     D=5120,
2025-05-07T20:33:22.5095284Z     scale_ub=1200.0,
2025-05-07T20:33:22.5095372Z     contiguous=False,
2025-05-07T20:33:22.5095458Z     compiled=False,
2025-05-07T20:33:22.5095533Z )
2025-05-07T20:33:22.5095754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5095920Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5095927Z 
2025-05-07T20:33:22.5096008Z     @given(
2025-05-07T20:33:22.5096126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5096225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5096344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5096460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5096577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5096662Z     )
2025-05-07T20:33:22.5096994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5097093Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5097173Z         self,
2025-05-07T20:33:22.5097292Z         T: int,
2025-05-07T20:33:22.5097370Z         D: int,
2025-05-07T20:33:22.5097472Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5097561Z         contiguous: bool,
2025-05-07T20:33:22.5097649Z         compiled: bool,
2025-05-07T20:33:22.5097727Z     ) -> None:
2025-05-07T20:33:22.5097821Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5097896Z     
2025-05-07T20:33:22.5098063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5098136Z     
2025-05-07T20:33:22.5098234Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5098358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5098446Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5098530Z         x0 = x[:, :D]
2025-05-07T20:33:22.5098616Z         x1 = x[:, D:]
2025-05-07T20:33:22.5098688Z     
2025-05-07T20:33:22.5098818Z         if contiguous:
2025-05-07T20:33:22.5098911Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5099002Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5099077Z     
2025-05-07T20:33:22.5099167Z         if scale_ub is not None:
2025-05-07T20:33:22.5099275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5099409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5099486Z             )
2025-05-07T20:33:22.5099574Z         else:
2025-05-07T20:33:22.5099683Z             scale_ub_tensor = None
2025-05-07T20:33:22.5099771Z     
2025-05-07T20:33:22.5099918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5100007Z             op = silu_mul_quant
2025-05-07T20:33:22.5100093Z             if compiled:
2025-05-07T20:33:22.5100195Z                 op = torch.compile(op)
2025-05-07T20:33:22.5100297Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5100378Z     
2025-05-07T20:33:22.5100468Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5100477Z 
2025-05-07T20:33:22.5100574Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5100704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5100810Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5100907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5101404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5101500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5101855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5102077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5102417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5102516Z     kernel = self.compile(
2025-05-07T20:33:22.5102896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5103070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5103201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5103206Z 
2025-05-07T20:33:22.5103408Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cffebbc0>
2025-05-07T20:33:22.5104187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5104736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cff5eac0>}
2025-05-07T20:33:22.5105524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5105752Z context = <triton._C.libtriton.ir.context object at 0x7f32cfb57270>
2025-05-07T20:33:22.5105757Z 
2025-05-07T20:33:22.5105919Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5106181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5106289Z                            module_map=module_map)
2025-05-07T20:33:22.5106449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5106555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5106632Z E       ^
2025-05-07T20:33:22.5106987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5106996Z 
2025-05-07T20:33:22.5107449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5107456Z 
2025-05-07T20:33:22.5107559Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5107783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5107862Z     T=16384,
2025-05-07T20:33:22.5107941Z     D=5120,
2025-05-07T20:33:22.5108024Z     scale_ub=1200.0,
2025-05-07T20:33:22.5108113Z     contiguous=False,
2025-05-07T20:33:22.5108199Z     compiled=True,
2025-05-07T20:33:22.5108272Z )
2025-05-07T20:33:22.5108487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5108666Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5108671Z 
2025-05-07T20:33:22.5108748Z     @given(
2025-05-07T20:33:22.5108866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5108974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5109091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5109209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5109326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5109401Z     )
2025-05-07T20:33:22.5109649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5109746Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5109841Z         self,
2025-05-07T20:33:22.5109929Z         T: int,
2025-05-07T20:33:22.5110023Z         D: int,
2025-05-07T20:33:22.5110122Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5110213Z         contiguous: bool,
2025-05-07T20:33:22.5110298Z         compiled: bool,
2025-05-07T20:33:22.5110375Z     ) -> None:
2025-05-07T20:33:22.5110471Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5110544Z     
2025-05-07T20:33:22.5110714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5110794Z     
2025-05-07T20:33:22.5110888Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5111015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5111105Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5111185Z         x0 = x[:, :D]
2025-05-07T20:33:22.5111267Z         x1 = x[:, D:]
2025-05-07T20:33:22.5111338Z     
2025-05-07T20:33:22.5111422Z         if contiguous:
2025-05-07T20:33:22.5111515Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5111603Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5111674Z     
2025-05-07T20:33:22.5111768Z         if scale_ub is not None:
2025-05-07T20:33:22.5111873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5112004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5112084Z             )
2025-05-07T20:33:22.5112161Z         else:
2025-05-07T20:33:22.5112258Z             scale_ub_tensor = None
2025-05-07T20:33:22.5112381Z     
2025-05-07T20:33:22.5112549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5112645Z             op = silu_mul_quant
2025-05-07T20:33:22.5112729Z             if compiled:
2025-05-07T20:33:22.5112869Z                 op = torch.compile(op)
2025-05-07T20:33:22.5112977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5113049Z     
2025-05-07T20:33:22.5113141Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5113145Z 
2025-05-07T20:33:22.5113245Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5113375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5113478Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5113579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5113942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5114036Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5114564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5114664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5115023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5115246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5115584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5115679Z     kernel = self.compile(
2025-05-07T20:33:22.5116157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5116335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5116461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5116465Z 
2025-05-07T20:33:22.5116674Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b4230>
2025-05-07T20:33:22.5117459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5117963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf910180>}
2025-05-07T20:33:22.5118708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5118898Z context = <triton._C.libtriton.ir.context object at 0x7f32cf9912f0>
2025-05-07T20:33:22.5118903Z 
2025-05-07T20:33:22.5119070Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5119337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5119443Z                            module_map=module_map)
2025-05-07T20:33:22.5119607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5119705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5119781Z E       ^
2025-05-07T20:33:22.5120138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5120142Z 
2025-05-07T20:33:22.5120551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5120555Z 
2025-05-07T20:33:22.5120660Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5120881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5120958Z     T=2048,
2025-05-07T20:33:22.5121036Z     D=7168,
2025-05-07T20:33:22.5121213Z     scale_ub=1200.0,
2025-05-07T20:33:22.5121302Z     contiguous=False,
2025-05-07T20:33:22.5121388Z     compiled=True,
2025-05-07T20:33:22.5121460Z )
2025-05-07T20:33:22.5121678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5121919Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5121923Z 
2025-05-07T20:33:22.5122002Z     @given(
2025-05-07T20:33:22.5122124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5122221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5122335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5122454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5122567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5122642Z     )
2025-05-07T20:33:22.5122888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5122990Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5123066Z         self,
2025-05-07T20:33:22.5123186Z         T: int,
2025-05-07T20:33:22.5123265Z         D: int,
2025-05-07T20:33:22.5123367Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5123458Z         contiguous: bool,
2025-05-07T20:33:22.5123542Z         compiled: bool,
2025-05-07T20:33:22.5123623Z     ) -> None:
2025-05-07T20:33:22.5123717Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5123789Z     
2025-05-07T20:33:22.5123961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5124034Z     
2025-05-07T20:33:22.5124126Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5124253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5124342Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5124423Z         x0 = x[:, :D]
2025-05-07T20:33:22.5124506Z         x1 = x[:, D:]
2025-05-07T20:33:22.5124578Z     
2025-05-07T20:33:22.5124665Z         if contiguous:
2025-05-07T20:33:22.5124764Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5124854Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5124931Z     
2025-05-07T20:33:22.5125021Z         if scale_ub is not None:
2025-05-07T20:33:22.5125129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5125265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5125341Z             )
2025-05-07T20:33:22.5125418Z         else:
2025-05-07T20:33:22.5125515Z             scale_ub_tensor = None
2025-05-07T20:33:22.5125586Z     
2025-05-07T20:33:22.5125716Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5125809Z             op = silu_mul_quant
2025-05-07T20:33:22.5125894Z             if compiled:
2025-05-07T20:33:22.5125997Z                 op = torch.compile(op)
2025-05-07T20:33:22.5126102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5126175Z     
2025-05-07T20:33:22.5126267Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5126274Z 
2025-05-07T20:33:22.5126373Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5126504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5126608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5126710Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5127074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5127169Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5127657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5127756Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5128111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5128331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5128720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5128856Z     kernel = self.compile(
2025-05-07T20:33:22.5129235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5129450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5129576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5129581Z 
2025-05-07T20:33:22.5129816Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b60f0>
2025-05-07T20:33:22.5130611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5131119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf910ea0>}
2025-05-07T20:33:22.5131902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5132097Z context = <triton._C.libtriton.ir.context object at 0x7f32cfbe04b0>
2025-05-07T20:33:22.5132102Z 
2025-05-07T20:33:22.5132267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5132527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5132638Z                            module_map=module_map)
2025-05-07T20:33:22.5132798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5132897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5132977Z E       ^
2025-05-07T20:33:22.5133334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5133343Z 
2025-05-07T20:33:22.5133755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5133766Z 
2025-05-07T20:33:22.5133867Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5134088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5134168Z     T=1,
2025-05-07T20:33:22.5134244Z     D=5120,
2025-05-07T20:33:22.5134325Z     scale_ub=None,
2025-05-07T20:33:22.5134413Z     contiguous=False,
2025-05-07T20:33:22.5134495Z     compiled=False,
2025-05-07T20:33:22.5134568Z )
2025-05-07T20:33:22.5134785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5134950Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5134955Z 
2025-05-07T20:33:22.5135034Z     @given(
2025-05-07T20:33:22.5135158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5135260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5135376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5135495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5135607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5135683Z     )
2025-05-07T20:33:22.5135923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5136015Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5136096Z         self,
2025-05-07T20:33:22.5136172Z         T: int,
2025-05-07T20:33:22.5136247Z         D: int,
2025-05-07T20:33:22.5136347Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5136436Z         contiguous: bool,
2025-05-07T20:33:22.5136524Z         compiled: bool,
2025-05-07T20:33:22.5136601Z     ) -> None:
2025-05-07T20:33:22.5136695Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5136818Z     
2025-05-07T20:33:22.5137027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5137102Z     
2025-05-07T20:33:22.5137196Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5137370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5137460Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5137543Z         x0 = x[:, :D]
2025-05-07T20:33:22.5137622Z         x1 = x[:, D:]
2025-05-07T20:33:22.5137694Z     
2025-05-07T20:33:22.5137778Z         if contiguous:
2025-05-07T20:33:22.5137869Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5137962Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5138034Z     
2025-05-07T20:33:22.5138123Z         if scale_ub is not None:
2025-05-07T20:33:22.5138230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5138361Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5138437Z             )
2025-05-07T20:33:22.5138517Z         else:
2025-05-07T20:33:22.5138616Z             scale_ub_tensor = None
2025-05-07T20:33:22.5138688Z     
2025-05-07T20:33:22.5138862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5138953Z             op = silu_mul_quant
2025-05-07T20:33:22.5139041Z             if compiled:
2025-05-07T20:33:22.5139141Z                 op = torch.compile(op)
2025-05-07T20:33:22.5139246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5139318Z     
2025-05-07T20:33:22.5139412Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5139417Z 
2025-05-07T20:33:22.5139515Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5139670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5139777Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5139891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5140388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5140490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5140849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5141075Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5141412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5141509Z     kernel = self.compile(
2025-05-07T20:33:22.5141886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5142058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5142189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5142193Z 
2025-05-07T20:33:22.5142396Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b7320>
2025-05-07T20:33:22.5143179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5143684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf911e40>}
2025-05-07T20:33:22.5144429Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5144619Z context = <triton._C.libtriton.ir.context object at 0x7f32cfcd45f0>
2025-05-07T20:33:22.5144623Z 
2025-05-07T20:33:22.5144785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5145047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5145242Z                            module_map=module_map)
2025-05-07T20:33:22.5145405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5145509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5145626Z E       ^
2025-05-07T20:33:22.5145982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5145987Z 
2025-05-07T20:33:22.5146397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5146401Z 
2025-05-07T20:33:22.5146502Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5146727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5146804Z     T=4096,
2025-05-07T20:33:22.5146884Z     D=7168,
2025-05-07T20:33:22.5146966Z     scale_ub=1200.0,
2025-05-07T20:33:22.5147052Z     contiguous=False,
2025-05-07T20:33:22.5147144Z     compiled=False,
2025-05-07T20:33:22.5147216Z )
2025-05-07T20:33:22.5147472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5147653Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5147661Z 
2025-05-07T20:33:22.5147738Z     @given(
2025-05-07T20:33:22.5147857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5147959Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5148073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5148191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5148303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5148376Z     )
2025-05-07T20:33:22.5148619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5148711Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5148786Z         self,
2025-05-07T20:33:22.5148865Z         T: int,
2025-05-07T20:33:22.5148945Z         D: int,
2025-05-07T20:33:22.5149045Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5149137Z         contiguous: bool,
2025-05-07T20:33:22.5149220Z         compiled: bool,
2025-05-07T20:33:22.5149300Z     ) -> None:
2025-05-07T20:33:22.5149401Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5149472Z     
2025-05-07T20:33:22.5149638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5149714Z     
2025-05-07T20:33:22.5149805Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5149932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5150020Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5150099Z         x0 = x[:, :D]
2025-05-07T20:33:22.5150181Z         x1 = x[:, D:]
2025-05-07T20:33:22.5150253Z     
2025-05-07T20:33:22.5150337Z         if contiguous:
2025-05-07T20:33:22.5150430Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5150518Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5150592Z     
2025-05-07T20:33:22.5150686Z         if scale_ub is not None:
2025-05-07T20:33:22.5150793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5150926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5151007Z             )
2025-05-07T20:33:22.5151082Z         else:
2025-05-07T20:33:22.5151178Z             scale_ub_tensor = None
2025-05-07T20:33:22.5151249Z     
2025-05-07T20:33:22.5151376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5151468Z             op = silu_mul_quant
2025-05-07T20:33:22.5151553Z             if compiled:
2025-05-07T20:33:22.5151652Z                 op = torch.compile(op)
2025-05-07T20:33:22.5151758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5151830Z     
2025-05-07T20:33:22.5151921Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5151925Z 
2025-05-07T20:33:22.5152023Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5152198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5152361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5152467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5152962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5153101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5153457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5153677Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5154016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5154109Z     kernel = self.compile(
2025-05-07T20:33:22.5154489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5154664Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5154833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5154838Z 
2025-05-07T20:33:22.5155048Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf9b6ed0>
2025-05-07T20:33:22.5155868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5156374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf913380>}
2025-05-07T20:33:22.5157115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5157310Z context = <triton._C.libtriton.ir.context object at 0x7f32cf6ad930>
2025-05-07T20:33:22.5157321Z 
2025-05-07T20:33:22.5157484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5157748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5157858Z                            module_map=module_map)
2025-05-07T20:33:22.5158017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5158116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5158200Z E       ^
2025-05-07T20:33:22.5158553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5158557Z 
2025-05-07T20:33:22.5158970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5158974Z 
2025-05-07T20:33:22.5159078Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5159302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5159385Z     T=16384,
2025-05-07T20:33:22.5159464Z     D=7168,
2025-05-07T20:33:22.5159549Z     scale_ub=None,
2025-05-07T20:33:22.5159636Z     contiguous=True,
2025-05-07T20:33:22.5159719Z     compiled=True,
2025-05-07T20:33:22.5159791Z )
2025-05-07T20:33:22.5160009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5160180Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.5160184Z 
2025-05-07T20:33:22.5160263Z     @given(
2025-05-07T20:33:22.5160382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5160480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5160598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5160714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5160874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5160990Z     )
2025-05-07T20:33:22.5161236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5161370Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5161451Z         self,
2025-05-07T20:33:22.5161527Z         T: int,
2025-05-07T20:33:22.5161609Z         D: int,
2025-05-07T20:33:22.5161706Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5161794Z         contiguous: bool,
2025-05-07T20:33:22.5161880Z         compiled: bool,
2025-05-07T20:33:22.5161958Z     ) -> None:
2025-05-07T20:33:22.5162051Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5162126Z     
2025-05-07T20:33:22.5162293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5162367Z     
2025-05-07T20:33:22.5162464Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5162587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5162674Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5162764Z         x0 = x[:, :D]
2025-05-07T20:33:22.5162885Z         x1 = x[:, D:]
2025-05-07T20:33:22.5162965Z     
2025-05-07T20:33:22.5163050Z         if contiguous:
2025-05-07T20:33:22.5163145Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5163236Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5163307Z     
2025-05-07T20:33:22.5163396Z         if scale_ub is not None:
2025-05-07T20:33:22.5163506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5163638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5163712Z             )
2025-05-07T20:33:22.5163790Z         else:
2025-05-07T20:33:22.5163883Z             scale_ub_tensor = None
2025-05-07T20:33:22.5163955Z     
2025-05-07T20:33:22.5164086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5164175Z             op = silu_mul_quant
2025-05-07T20:33:22.5164259Z             if compiled:
2025-05-07T20:33:22.5164359Z                 op = torch.compile(op)
2025-05-07T20:33:22.5164468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5164544Z     
2025-05-07T20:33:22.5164634Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5164639Z 
2025-05-07T20:33:22.5164737Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5164869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5164968Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5165066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5165663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5165801Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5166308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5166411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5166773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5167001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5167338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5167435Z     kernel = self.compile(
2025-05-07T20:33:22.5167816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5167990Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5168122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5168126Z 
2025-05-07T20:33:22.5168331Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf6345f0>
2025-05-07T20:33:22.5169192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5169760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a44a0>}
2025-05-07T20:33:22.5170563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5170758Z context = <triton._C.libtriton.ir.context object at 0x7f32cf609df0>
2025-05-07T20:33:22.5170763Z 
2025-05-07T20:33:22.5170926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5171189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5171295Z                            module_map=module_map)
2025-05-07T20:33:22.5171458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5171622Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5171701Z E       ^
2025-05-07T20:33:22.5172056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5172063Z 
2025-05-07T20:33:22.5172479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5172484Z 
2025-05-07T20:33:22.5172586Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5172812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5172888Z     T=4096,
2025-05-07T20:33:22.5172964Z     D=5120,
2025-05-07T20:33:22.5173048Z     scale_ub=None,
2025-05-07T20:33:22.5173133Z     contiguous=False,
2025-05-07T20:33:22.5173215Z     compiled=True,
2025-05-07T20:33:22.5173291Z )
2025-05-07T20:33:22.5173508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5173686Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5173694Z 
2025-05-07T20:33:22.5173772Z     @given(
2025-05-07T20:33:22.5173892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5173994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5174110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5174227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5174341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5174416Z     )
2025-05-07T20:33:22.5174658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5174753Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5174828Z         self,
2025-05-07T20:33:22.5174904Z         T: int,
2025-05-07T20:33:22.5174983Z         D: int,
2025-05-07T20:33:22.5175082Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5175177Z         contiguous: bool,
2025-05-07T20:33:22.5175262Z         compiled: bool,
2025-05-07T20:33:22.5175343Z     ) -> None:
2025-05-07T20:33:22.5175440Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5175511Z     
2025-05-07T20:33:22.5175683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5175759Z     
2025-05-07T20:33:22.5175850Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5175973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5176064Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5176144Z         x0 = x[:, :D]
2025-05-07T20:33:22.5176223Z         x1 = x[:, D:]
2025-05-07T20:33:22.5176298Z     
2025-05-07T20:33:22.5176382Z         if contiguous:
2025-05-07T20:33:22.5179670Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5179775Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5179853Z     
2025-05-07T20:33:22.5179946Z         if scale_ub is not None:
2025-05-07T20:33:22.5180052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5180302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5180380Z             )
2025-05-07T20:33:22.5180457Z         else:
2025-05-07T20:33:22.5180606Z             scale_ub_tensor = None
2025-05-07T20:33:22.5180680Z     
2025-05-07T20:33:22.5180816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5180908Z             op = silu_mul_quant
2025-05-07T20:33:22.5180994Z             if compiled:
2025-05-07T20:33:22.5181099Z                 op = torch.compile(op)
2025-05-07T20:33:22.5181204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5181276Z     
2025-05-07T20:33:22.5181370Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5181375Z 
2025-05-07T20:33:22.5181474Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5181603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5181708Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5181809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5182239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5182334Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5182828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5182929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5183282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5183501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5183841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5183935Z     kernel = self.compile(
2025-05-07T20:33:22.5184319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5184501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5184627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5184634Z 
2025-05-07T20:33:22.5184841Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf636270>
2025-05-07T20:33:22.5185618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5186123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a51c0>}
2025-05-07T20:33:22.5186866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5187060Z context = <triton._C.libtriton.ir.context object at 0x7f32cfcdf8b0>
2025-05-07T20:33:22.5187068Z 
2025-05-07T20:33:22.5187235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5187496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5187608Z                            module_map=module_map)
2025-05-07T20:33:22.5187768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5187865Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5187947Z E       ^
2025-05-07T20:33:22.5188300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5188304Z 
2025-05-07T20:33:22.5188717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5188830Z 
2025-05-07T20:33:22.5188936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5189162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5189284Z     T=4096,
2025-05-07T20:33:22.5189364Z     D=5120,
2025-05-07T20:33:22.5189449Z     scale_ub=1200.0,
2025-05-07T20:33:22.5189539Z     contiguous=False,
2025-05-07T20:33:22.5189622Z     compiled=False,
2025-05-07T20:33:22.5189694Z )
2025-05-07T20:33:22.5189913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5190091Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5190096Z 
2025-05-07T20:33:22.5190177Z     @given(
2025-05-07T20:33:22.5190295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5190394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5190512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5190630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5190788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5190866Z     )
2025-05-07T20:33:22.5191108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5191208Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5191286Z         self,
2025-05-07T20:33:22.5191362Z         T: int,
2025-05-07T20:33:22.5191441Z         D: int,
2025-05-07T20:33:22.5191541Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5191631Z         contiguous: bool,
2025-05-07T20:33:22.5191718Z         compiled: bool,
2025-05-07T20:33:22.5191796Z     ) -> None:
2025-05-07T20:33:22.5191890Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5191966Z     
2025-05-07T20:33:22.5192134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5192208Z     
2025-05-07T20:33:22.5192303Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5192428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5192519Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5192604Z         x0 = x[:, :D]
2025-05-07T20:33:22.5192687Z         x1 = x[:, D:]
2025-05-07T20:33:22.5192762Z     
2025-05-07T20:33:22.5192849Z         if contiguous:
2025-05-07T20:33:22.5192941Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5193031Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5193102Z     
2025-05-07T20:33:22.5193192Z         if scale_ub is not None:
2025-05-07T20:33:22.5193300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5193432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5193507Z             )
2025-05-07T20:33:22.5193586Z         else:
2025-05-07T20:33:22.5193680Z             scale_ub_tensor = None
2025-05-07T20:33:22.5193752Z     
2025-05-07T20:33:22.5193883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5193973Z             op = silu_mul_quant
2025-05-07T20:33:22.5194064Z             if compiled:
2025-05-07T20:33:22.5194165Z                 op = torch.compile(op)
2025-05-07T20:33:22.5194270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5194349Z     
2025-05-07T20:33:22.5194443Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5194448Z 
2025-05-07T20:33:22.5194544Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5194674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5194774Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5194873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5195369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5195464Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5195890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5196163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5196575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5196670Z     kernel = self.compile(
2025-05-07T20:33:22.5197090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5197264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5197391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5197395Z 
2025-05-07T20:33:22.5197597Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf637770>
2025-05-07T20:33:22.5198375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5198920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a6160>}
2025-05-07T20:33:22.5199670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5199864Z context = <triton._C.libtriton.ir.context object at 0x7f32cfccee30>
2025-05-07T20:33:22.5199869Z 
2025-05-07T20:33:22.5200031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5200294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5200400Z                            module_map=module_map)
2025-05-07T20:33:22.5200565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5200664Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5200748Z E       ^
2025-05-07T20:33:22.5201108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5201112Z 
2025-05-07T20:33:22.5201528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5201533Z 
2025-05-07T20:33:22.5201638Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5201861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5201937Z     T=4096,
2025-05-07T20:33:22.5202016Z     D=5120,
2025-05-07T20:33:22.5202098Z     scale_ub=1200.0,
2025-05-07T20:33:22.5202185Z     contiguous=False,
2025-05-07T20:33:22.5202271Z     compiled=True,
2025-05-07T20:33:22.5202343Z )
2025-05-07T20:33:22.5202560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5202740Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5202749Z 
2025-05-07T20:33:22.5202826Z     @given(
2025-05-07T20:33:22.5202951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5203050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5203167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5203287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5203400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5203473Z     )
2025-05-07T20:33:22.5203719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5203812Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5203889Z         self,
2025-05-07T20:33:22.5203969Z         T: int,
2025-05-07T20:33:22.5204045Z         D: int,
2025-05-07T20:33:22.5204142Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5204233Z         contiguous: bool,
2025-05-07T20:33:22.5204318Z         compiled: bool,
2025-05-07T20:33:22.5204399Z     ) -> None:
2025-05-07T20:33:22.5204581Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5204655Z     
2025-05-07T20:33:22.5204826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5204939Z     
2025-05-07T20:33:22.5205031Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5205160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5205247Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5205328Z         x0 = x[:, :D]
2025-05-07T20:33:22.5205408Z         x1 = x[:, D:]
2025-05-07T20:33:22.5205480Z     
2025-05-07T20:33:22.5205563Z         if contiguous:
2025-05-07T20:33:22.5205657Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5205744Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5205816Z     
2025-05-07T20:33:22.5205910Z         if scale_ub is not None:
2025-05-07T20:33:22.5206014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5206149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5206230Z             )
2025-05-07T20:33:22.5206306Z         else:
2025-05-07T20:33:22.5206442Z             scale_ub_tensor = None
2025-05-07T20:33:22.5206516Z     
2025-05-07T20:33:22.5206646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5206744Z             op = silu_mul_quant
2025-05-07T20:33:22.5206829Z             if compiled:
2025-05-07T20:33:22.5206928Z                 op = torch.compile(op)
2025-05-07T20:33:22.5207033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5207106Z     
2025-05-07T20:33:22.5207195Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5207202Z 
2025-05-07T20:33:22.5207300Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5207428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5207530Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5207629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5207997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5208099Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5208589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5208694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5209049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5209268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5209610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5209702Z     kernel = self.compile(
2025-05-07T20:33:22.5210080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5210255Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5210388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5210393Z 
2025-05-07T20:33:22.5210599Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf635970>
2025-05-07T20:33:22.5211374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5211876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf6a7240>}
2025-05-07T20:33:22.5212622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5212856Z context = <triton._C.libtriton.ir.context object at 0x7f32cfc3fbf0>
2025-05-07T20:33:22.5212898Z 
2025-05-07T20:33:22.5213065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5213328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5213476Z                            module_map=module_map)
2025-05-07T20:33:22.5213639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5213738Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5213819Z E       ^
2025-05-07T20:33:22.5214173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5214177Z 
2025-05-07T20:33:22.5214586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5214590Z 
2025-05-07T20:33:22.5214694Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5214920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5215067Z     T=2048,
2025-05-07T20:33:22.5215148Z     D=7168,
2025-05-07T20:33:22.5215232Z     scale_ub=1200.0,
2025-05-07T20:33:22.5215329Z     contiguous=False,
2025-05-07T20:33:22.5215411Z     compiled=False,
2025-05-07T20:33:22.5215483Z )
2025-05-07T20:33:22.5215700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5215874Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5215879Z 
2025-05-07T20:33:22.5215956Z     @given(
2025-05-07T20:33:22.5216078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5216176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5216294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5216410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5216521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5216606Z     )
2025-05-07T20:33:22.5216851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5216942Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5217023Z         self,
2025-05-07T20:33:22.5217101Z         T: int,
2025-05-07T20:33:22.5217178Z         D: int,
2025-05-07T20:33:22.5217278Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5217367Z         contiguous: bool,
2025-05-07T20:33:22.5217450Z         compiled: bool,
2025-05-07T20:33:22.5217529Z     ) -> None:
2025-05-07T20:33:22.5217627Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5217700Z     
2025-05-07T20:33:22.5217868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5217944Z     
2025-05-07T20:33:22.5218038Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5218166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5218256Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5218336Z         x0 = x[:, :D]
2025-05-07T20:33:22.5218423Z         x1 = x[:, D:]
2025-05-07T20:33:22.5218496Z     
2025-05-07T20:33:22.5218581Z         if contiguous:
2025-05-07T20:33:22.5218675Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5218765Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5218837Z     
2025-05-07T20:33:22.5218932Z         if scale_ub is not None:
2025-05-07T20:33:22.5219037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5219173Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5219250Z             )
2025-05-07T20:33:22.5219326Z         else:
2025-05-07T20:33:22.5219425Z             scale_ub_tensor = None
2025-05-07T20:33:22.5219497Z     
2025-05-07T20:33:22.5219629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5219743Z             op = silu_mul_quant
2025-05-07T20:33:22.5219834Z             if compiled:
2025-05-07T20:33:22.5219950Z                 op = torch.compile(op)
2025-05-07T20:33:22.5220104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5220214Z     
2025-05-07T20:33:22.5220306Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5220311Z 
2025-05-07T20:33:22.5220409Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5220576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5220677Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5220776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5221271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5221369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5221724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5221948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5222294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5222429Z     kernel = self.compile(
2025-05-07T20:33:22.5222811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5222988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5223115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5223120Z 
2025-05-07T20:33:22.5223326Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f04d0>
2025-05-07T20:33:22.5224099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5224606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72c220>}
2025-05-07T20:33:22.5225351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5225542Z context = <triton._C.libtriton.ir.context object at 0x7f32cf4ca6f0>
2025-05-07T20:33:22.5225549Z 
2025-05-07T20:33:22.5225712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5225973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5226081Z                            module_map=module_map)
2025-05-07T20:33:22.5226246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5226343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5226422Z E       ^
2025-05-07T20:33:22.5226780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5226787Z 
2025-05-07T20:33:22.5227201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5227209Z 
2025-05-07T20:33:22.5227311Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5227532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5227611Z     T=1,
2025-05-07T20:33:22.5227688Z     D=7168,
2025-05-07T20:33:22.5227769Z     scale_ub=None,
2025-05-07T20:33:22.5227857Z     contiguous=True,
2025-05-07T20:33:22.5227941Z     compiled=False,
2025-05-07T20:33:22.5228013Z )
2025-05-07T20:33:22.5228231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5228394Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5228399Z 
2025-05-07T20:33:22.5228479Z     @given(
2025-05-07T20:33:22.5228648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5228785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5228904Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5229020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5229172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5229248Z     )
2025-05-07T20:33:22.5229490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5229586Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5229676Z         self,
2025-05-07T20:33:22.5229767Z         T: int,
2025-05-07T20:33:22.5229854Z         D: int,
2025-05-07T20:33:22.5229969Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5230058Z         contiguous: bool,
2025-05-07T20:33:22.5230144Z         compiled: bool,
2025-05-07T20:33:22.5230222Z     ) -> None:
2025-05-07T20:33:22.5230316Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5230391Z     
2025-05-07T20:33:22.5230562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5230637Z     
2025-05-07T20:33:22.5230775Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5230900Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5230991Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5231074Z         x0 = x[:, :D]
2025-05-07T20:33:22.5231152Z         x1 = x[:, D:]
2025-05-07T20:33:22.5231230Z     
2025-05-07T20:33:22.5231314Z         if contiguous:
2025-05-07T20:33:22.5231404Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5231495Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5231567Z     
2025-05-07T20:33:22.5231656Z         if scale_ub is not None:
2025-05-07T20:33:22.5231762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5231894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5231969Z             )
2025-05-07T20:33:22.5232048Z         else:
2025-05-07T20:33:22.5232141Z             scale_ub_tensor = None
2025-05-07T20:33:22.5232218Z     
2025-05-07T20:33:22.5232352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5232442Z             op = silu_mul_quant
2025-05-07T20:33:22.5232526Z             if compiled:
2025-05-07T20:33:22.5232630Z                 op = torch.compile(op)
2025-05-07T20:33:22.5232734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5232809Z     
2025-05-07T20:33:22.5232899Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5232904Z 
2025-05-07T20:33:22.5233000Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5233131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5233230Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5233329Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5233824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5233920Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5234285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5234505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5234845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5234940Z     kernel = self.compile(
2025-05-07T20:33:22.5235318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5235489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5235619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5235623Z 
2025-05-07T20:33:22.5235904Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f2630>
2025-05-07T20:33:22.5236739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5237278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72d120>}
2025-05-07T20:33:22.5238064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5238253Z context = <triton._C.libtriton.ir.context object at 0x7f32cf5471b0>
2025-05-07T20:33:22.5238258Z 
2025-05-07T20:33:22.5238420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5238682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5238791Z                            module_map=module_map)
2025-05-07T20:33:22.5238998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5239099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5239181Z E       ^
2025-05-07T20:33:22.5239536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5239541Z 
2025-05-07T20:33:22.5239951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5239956Z 
2025-05-07T20:33:22.5240057Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5240282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5240361Z     T=16384,
2025-05-07T20:33:22.5240439Z     D=7168,
2025-05-07T20:33:22.5240521Z     scale_ub=1200.0,
2025-05-07T20:33:22.5240611Z     contiguous=False,
2025-05-07T20:33:22.5240696Z     compiled=True,
2025-05-07T20:33:22.5240771Z )
2025-05-07T20:33:22.5240991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5241174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5241181Z 
2025-05-07T20:33:22.5241259Z     @given(
2025-05-07T20:33:22.5241377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5241477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5241591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5241708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5241824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5241897Z     )
2025-05-07T20:33:22.5242141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5242232Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5242308Z         self,
2025-05-07T20:33:22.5242387Z         T: int,
2025-05-07T20:33:22.5242462Z         D: int,
2025-05-07T20:33:22.5242564Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5242657Z         contiguous: bool,
2025-05-07T20:33:22.5242742Z         compiled: bool,
2025-05-07T20:33:22.5242819Z     ) -> None:
2025-05-07T20:33:22.5242917Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5242991Z     
2025-05-07T20:33:22.5243158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5243234Z     
2025-05-07T20:33:22.5243326Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5243454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5243541Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5243623Z         x0 = x[:, :D]
2025-05-07T20:33:22.5243708Z         x1 = x[:, D:]
2025-05-07T20:33:22.5243781Z     
2025-05-07T20:33:22.5243863Z         if contiguous:
2025-05-07T20:33:22.5243957Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5244046Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5244118Z     
2025-05-07T20:33:22.5244259Z         if scale_ub is not None:
2025-05-07T20:33:22.5244402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5244536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5244678Z             )
2025-05-07T20:33:22.5244756Z         else:
2025-05-07T20:33:22.5244849Z             scale_ub_tensor = None
2025-05-07T20:33:22.5244923Z     
2025-05-07T20:33:22.5245051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5245144Z             op = silu_mul_quant
2025-05-07T20:33:22.5245230Z             if compiled:
2025-05-07T20:33:22.5245328Z                 op = torch.compile(op)
2025-05-07T20:33:22.5245434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5245507Z     
2025-05-07T20:33:22.5245597Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5245601Z 
2025-05-07T20:33:22.5245700Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5245827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5245932Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5246078Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5246444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5246543Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5247031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5247127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5247483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5247703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5248043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5248136Z     kernel = self.compile(
2025-05-07T20:33:22.5248523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5248698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5248830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5248835Z 
2025-05-07T20:33:22.5249037Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f1c70>
2025-05-07T20:33:22.5249814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5250315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72e520>}
2025-05-07T20:33:22.5251063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5251255Z context = <triton._C.libtriton.ir.context object at 0x7f32cf7ee030>
2025-05-07T20:33:22.5251261Z 
2025-05-07T20:33:22.5251430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5251691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5251796Z                            module_map=module_map)
2025-05-07T20:33:22.5251959Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5252057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5252133Z E       ^
2025-05-07T20:33:22.5252487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5252492Z 
2025-05-07T20:33:22.5252945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5252988Z 
2025-05-07T20:33:22.5253094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5253314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5253430Z     T=1,
2025-05-07T20:33:22.5253509Z     D=7168,
2025-05-07T20:33:22.5253591Z     scale_ub=None,
2025-05-07T20:33:22.5253676Z     contiguous=False,
2025-05-07T20:33:22.5253761Z     compiled=False,
2025-05-07T20:33:22.5253834Z )
2025-05-07T20:33:22.5254057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5254221Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5254225Z 
2025-05-07T20:33:22.5254303Z     @given(
2025-05-07T20:33:22.5254423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5254523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5254639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5254802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5254917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5254996Z     )
2025-05-07T20:33:22.5255246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5255338Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5255416Z         self,
2025-05-07T20:33:22.5255492Z         T: int,
2025-05-07T20:33:22.5255567Z         D: int,
2025-05-07T20:33:22.5255665Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5255753Z         contiguous: bool,
2025-05-07T20:33:22.5255837Z         compiled: bool,
2025-05-07T20:33:22.5255916Z     ) -> None:
2025-05-07T20:33:22.5256010Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5256081Z     
2025-05-07T20:33:22.5256254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5256327Z     
2025-05-07T20:33:22.5256420Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5256553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5256640Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5256723Z         x0 = x[:, :D]
2025-05-07T20:33:22.5256804Z         x1 = x[:, D:]
2025-05-07T20:33:22.5256875Z     
2025-05-07T20:33:22.5256963Z         if contiguous:
2025-05-07T20:33:22.5257053Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5257141Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5257215Z     
2025-05-07T20:33:22.5257306Z         if scale_ub is not None:
2025-05-07T20:33:22.5257410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5257545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5257619Z             )
2025-05-07T20:33:22.5257696Z         else:
2025-05-07T20:33:22.5257793Z             scale_ub_tensor = None
2025-05-07T20:33:22.5257866Z     
2025-05-07T20:33:22.5257993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5258092Z             op = silu_mul_quant
2025-05-07T20:33:22.5258177Z             if compiled:
2025-05-07T20:33:22.5258281Z                 op = torch.compile(op)
2025-05-07T20:33:22.5258384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5258458Z     
2025-05-07T20:33:22.5258550Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5258554Z 
2025-05-07T20:33:22.5258649Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5258776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5258880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5258978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5259474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5259570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5259926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5260241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5260579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5260711Z     kernel = self.compile(
2025-05-07T20:33:22.5261095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5261267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5261396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5261401Z 
2025-05-07T20:33:22.5261602Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf7f3770>
2025-05-07T20:33:22.5262383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5262931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf72f100>}
2025-05-07T20:33:22.5263680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5263871Z context = <triton._C.libtriton.ir.context object at 0x7f33f009f1f0>
2025-05-07T20:33:22.5263875Z 
2025-05-07T20:33:22.5264037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5264297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5264406Z                            module_map=module_map)
2025-05-07T20:33:22.5264564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5264672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5264748Z E       ^
2025-05-07T20:33:22.5265106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5265113Z 
2025-05-07T20:33:22.5265773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5265779Z 
2025-05-07T20:33:22.5265887Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5266112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5266191Z     T=2048,
2025-05-07T20:33:22.5266267Z     D=7168,
2025-05-07T20:33:22.5266353Z     scale_ub=None,
2025-05-07T20:33:22.5266438Z     contiguous=False,
2025-05-07T20:33:22.5266521Z     compiled=True,
2025-05-07T20:33:22.5266596Z )
2025-05-07T20:33:22.5266813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5266992Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5267000Z 
2025-05-07T20:33:22.5267080Z     @given(
2025-05-07T20:33:22.5267197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5267301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5267414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5267529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5267644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5267718Z     )
2025-05-07T20:33:22.5267960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5268057Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5268134Z         self,
2025-05-07T20:33:22.5268210Z         T: int,
2025-05-07T20:33:22.5268289Z         D: int,
2025-05-07T20:33:22.5268387Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5268476Z         contiguous: bool,
2025-05-07T20:33:22.5268718Z         compiled: bool,
2025-05-07T20:33:22.5268798Z     ) -> None:
2025-05-07T20:33:22.5268897Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5268969Z     
2025-05-07T20:33:22.5269140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5269275Z     
2025-05-07T20:33:22.5269366Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5269490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5269580Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5269660Z         x0 = x[:, :D]
2025-05-07T20:33:22.5269739Z         x1 = x[:, D:]
2025-05-07T20:33:22.5269813Z     
2025-05-07T20:33:22.5269895Z         if contiguous:
2025-05-07T20:33:22.5269986Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5270076Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5270148Z     
2025-05-07T20:33:22.5270243Z         if scale_ub is not None:
2025-05-07T20:33:22.5270348Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5270483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5270568Z             )
2025-05-07T20:33:22.5270698Z         else:
2025-05-07T20:33:22.5270793Z             scale_ub_tensor = None
2025-05-07T20:33:22.5270871Z     
2025-05-07T20:33:22.5270999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5271090Z             op = silu_mul_quant
2025-05-07T20:33:22.5271179Z             if compiled:
2025-05-07T20:33:22.5271276Z                 op = torch.compile(op)
2025-05-07T20:33:22.5271380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5271456Z     
2025-05-07T20:33:22.5271546Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5271551Z 
2025-05-07T20:33:22.5271649Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5271777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5271877Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5271977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5272351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5272445Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5272938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5273036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5273394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5273615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5273951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5274047Z     kernel = self.compile(
2025-05-07T20:33:22.5274425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5274601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5274734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5274741Z 
2025-05-07T20:33:22.5274944Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55f410>
2025-05-07T20:33:22.5275775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5276277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf558720>}
2025-05-07T20:33:22.5277070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5277325Z context = <triton._C.libtriton.ir.context object at 0x7f32cf5ecaf0>
2025-05-07T20:33:22.5277329Z 
2025-05-07T20:33:22.5277494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5277799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5277906Z                            module_map=module_map)
2025-05-07T20:33:22.5278069Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5278170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5278247Z E       ^
2025-05-07T20:33:22.5278602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5278606Z 
2025-05-07T20:33:22.5279014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5279018Z 
2025-05-07T20:33:22.5279126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5279388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5279468Z     T=4096,
2025-05-07T20:33:22.5279552Z     D=7168,
2025-05-07T20:33:22.5279636Z     scale_ub=None,
2025-05-07T20:33:22.5279724Z     contiguous=False,
2025-05-07T20:33:22.5279812Z     compiled=True,
2025-05-07T20:33:22.5279886Z )
2025-05-07T20:33:22.5280102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5280278Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5280282Z 
2025-05-07T20:33:22.5280360Z     @given(
2025-05-07T20:33:22.5280480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5280582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5280698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5280818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5280938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5281014Z     )
2025-05-07T20:33:22.5281267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5281362Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5281440Z         self,
2025-05-07T20:33:22.5281521Z         T: int,
2025-05-07T20:33:22.5281599Z         D: int,
2025-05-07T20:33:22.5281697Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5281787Z         contiguous: bool,
2025-05-07T20:33:22.5281871Z         compiled: bool,
2025-05-07T20:33:22.5281947Z     ) -> None:
2025-05-07T20:33:22.5282044Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5282116Z     
2025-05-07T20:33:22.5282285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5282359Z     
2025-05-07T20:33:22.5282449Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5282575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5282669Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5282748Z         x0 = x[:, :D]
2025-05-07T20:33:22.5282835Z         x1 = x[:, D:]
2025-05-07T20:33:22.5282907Z     
2025-05-07T20:33:22.5282991Z         if contiguous:
2025-05-07T20:33:22.5283088Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5283175Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5283246Z     
2025-05-07T20:33:22.5283339Z         if scale_ub is not None:
2025-05-07T20:33:22.5283442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5283578Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5283653Z             )
2025-05-07T20:33:22.5283730Z         else:
2025-05-07T20:33:22.5283828Z             scale_ub_tensor = None
2025-05-07T20:33:22.5283900Z     
2025-05-07T20:33:22.5284028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5284122Z             op = silu_mul_quant
2025-05-07T20:33:22.5284207Z             if compiled:
2025-05-07T20:33:22.5284353Z                 op = torch.compile(op)
2025-05-07T20:33:22.5284504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5284577Z     
2025-05-07T20:33:22.5284669Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5284711Z 
2025-05-07T20:33:22.5284810Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5284937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5285043Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5285141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5285505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5285599Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5286087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5286183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5286543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5286804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5287147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5287240Z     kernel = self.compile(
2025-05-07T20:33:22.5287620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5287795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5287923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5287927Z 
2025-05-07T20:33:22.5288134Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55ca40>
2025-05-07T20:33:22.5288913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5289415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf559440>}
2025-05-07T20:33:22.5290161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5290349Z context = <triton._C.libtriton.ir.context object at 0x7f33f069bfb0>
2025-05-07T20:33:22.5290354Z 
2025-05-07T20:33:22.5290521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5290783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5290890Z                            module_map=module_map)
2025-05-07T20:33:22.5291057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5291160Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5291239Z E       ^
2025-05-07T20:33:22.5291593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5291599Z 
2025-05-07T20:33:22.5292007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5292012Z 
2025-05-07T20:33:22.5292118Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5292341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5292424Z     T=16384,
2025-05-07T20:33:22.5292502Z     D=5120,
2025-05-07T20:33:22.5292585Z     scale_ub=1200.0,
2025-05-07T20:33:22.5292675Z     contiguous=False,
2025-05-07T20:33:22.5292760Z     compiled=False,
2025-05-07T20:33:22.5292834Z )
2025-05-07T20:33:22.5293097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5293317Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5293321Z 
2025-05-07T20:33:22.5293403Z     @given(
2025-05-07T20:33:22.5293566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5293666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5293779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5293899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5294011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5294087Z     )
2025-05-07T20:33:22.5294329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5294420Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5294498Z         self,
2025-05-07T20:33:22.5294574Z         T: int,
2025-05-07T20:33:22.5294650Z         D: int,
2025-05-07T20:33:22.5294749Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5294843Z         contiguous: bool,
2025-05-07T20:33:22.5294966Z         compiled: bool,
2025-05-07T20:33:22.5295049Z     ) -> None:
2025-05-07T20:33:22.5295141Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5295218Z     
2025-05-07T20:33:22.5295393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5295467Z     
2025-05-07T20:33:22.5295560Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5295684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5295772Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5295854Z         x0 = x[:, :D]
2025-05-07T20:33:22.5295933Z         x1 = x[:, D:]
2025-05-07T20:33:22.5296004Z     
2025-05-07T20:33:22.5296090Z         if contiguous:
2025-05-07T20:33:22.5296181Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5296272Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5296346Z     
2025-05-07T20:33:22.5296438Z         if scale_ub is not None:
2025-05-07T20:33:22.5296546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5296685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5296762Z             )
2025-05-07T20:33:22.5296840Z         else:
2025-05-07T20:33:22.5296937Z             scale_ub_tensor = None
2025-05-07T20:33:22.5297010Z     
2025-05-07T20:33:22.5297140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5297229Z             op = silu_mul_quant
2025-05-07T20:33:22.5297314Z             if compiled:
2025-05-07T20:33:22.5297415Z                 op = torch.compile(op)
2025-05-07T20:33:22.5297520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5300738Z     
2025-05-07T20:33:22.5300844Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5300849Z 
2025-05-07T20:33:22.5300952Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5301081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5301182Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5301294Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5301796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5301896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5302259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5302479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5302822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5302916Z     kernel = self.compile(
2025-05-07T20:33:22.5303297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5303476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5303668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5303710Z 
2025-05-07T20:33:22.5303921Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55fc50>
2025-05-07T20:33:22.5304734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5305237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf55a340>}
2025-05-07T20:33:22.5305982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5306171Z context = <triton._C.libtriton.ir.context object at 0x7f32cf3b73b0>
2025-05-07T20:33:22.5306181Z 
2025-05-07T20:33:22.5306387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5306650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5306761Z                            module_map=module_map)
2025-05-07T20:33:22.5306924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5307023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5307104Z E       ^
2025-05-07T20:33:22.5307456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5307461Z 
2025-05-07T20:33:22.5307872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5307876Z 
2025-05-07T20:33:22.5307981Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5308205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5308290Z     T=16384,
2025-05-07T20:33:22.5308369Z     D=5120,
2025-05-07T20:33:22.5308451Z     scale_ub=1200.0,
2025-05-07T20:33:22.5308543Z     contiguous=True,
2025-05-07T20:33:22.5308631Z     compiled=True,
2025-05-07T20:33:22.5308704Z )
2025-05-07T20:33:22.5308924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5309098Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5309102Z 
2025-05-07T20:33:22.5309180Z     @given(
2025-05-07T20:33:22.5309304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5309402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5309515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5309635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5309748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5309824Z     )
2025-05-07T20:33:22.5310074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5310173Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5310253Z         self,
2025-05-07T20:33:22.5310333Z         T: int,
2025-05-07T20:33:22.5310413Z         D: int,
2025-05-07T20:33:22.5310518Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5310606Z         contiguous: bool,
2025-05-07T20:33:22.5310690Z         compiled: bool,
2025-05-07T20:33:22.5310773Z     ) -> None:
2025-05-07T20:33:22.5310869Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5310942Z     
2025-05-07T20:33:22.5311112Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5311187Z     
2025-05-07T20:33:22.5311284Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5311407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5311495Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5311578Z         x0 = x[:, :D]
2025-05-07T20:33:22.5311656Z         x1 = x[:, D:]
2025-05-07T20:33:22.5311839Z     
2025-05-07T20:33:22.5311928Z         if contiguous:
2025-05-07T20:33:22.5312022Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5312109Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5312230Z     
2025-05-07T20:33:22.5312321Z         if scale_ub is not None:
2025-05-07T20:33:22.5312425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5312560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5312634Z             )
2025-05-07T20:33:22.5312717Z         else:
2025-05-07T20:33:22.5312810Z             scale_ub_tensor = None
2025-05-07T20:33:22.5312882Z     
2025-05-07T20:33:22.5313013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5313102Z             op = silu_mul_quant
2025-05-07T20:33:22.5313186Z             if compiled:
2025-05-07T20:33:22.5313288Z                 op = torch.compile(op)
2025-05-07T20:33:22.5313394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5313473Z     
2025-05-07T20:33:22.5313567Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5313611Z 
2025-05-07T20:33:22.5313709Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5313839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5313943Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5314041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5314409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5314501Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5314989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5315089Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5315444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5315673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5316092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5316190Z     kernel = self.compile(
2025-05-07T20:33:22.5316572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5316745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5316871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5316876Z 
2025-05-07T20:33:22.5317086Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf55fe00>
2025-05-07T20:33:22.5317861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5318373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf55b9c0>}
2025-05-07T20:33:22.5319118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5319310Z context = <triton._C.libtriton.ir.context object at 0x7f33f0681a70>
2025-05-07T20:33:22.5319315Z 
2025-05-07T20:33:22.5319477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5319739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5319849Z                            module_map=module_map)
2025-05-07T20:33:22.5320008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5320107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5320835Z E       ^
2025-05-07T20:33:22.5321195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5321239Z 
2025-05-07T20:33:22.5321656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5321661Z 
2025-05-07T20:33:22.5321763Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5321983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5322064Z     T=16384,
2025-05-07T20:33:22.5322140Z     D=5120,
2025-05-07T20:33:22.5322223Z     scale_ub=None,
2025-05-07T20:33:22.5322313Z     contiguous=False,
2025-05-07T20:33:22.5322394Z     compiled=True,
2025-05-07T20:33:22.5322472Z )
2025-05-07T20:33:22.5322688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5322867Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5322874Z 
2025-05-07T20:33:22.5322960Z     @given(
2025-05-07T20:33:22.5323118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5323218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5323338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5323454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5323571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5323644Z     )
2025-05-07T20:33:22.5323887Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5323981Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5324058Z         self,
2025-05-07T20:33:22.5324138Z         T: int,
2025-05-07T20:33:22.5324219Z         D: int,
2025-05-07T20:33:22.5324316Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5324405Z         contiguous: bool,
2025-05-07T20:33:22.5324494Z         compiled: bool,
2025-05-07T20:33:22.5324576Z     ) -> None:
2025-05-07T20:33:22.5324670Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5324748Z     
2025-05-07T20:33:22.5324917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5324993Z     
2025-05-07T20:33:22.5325088Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5325211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5325305Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5325385Z         x0 = x[:, :D]
2025-05-07T20:33:22.5325463Z         x1 = x[:, D:]
2025-05-07T20:33:22.5325540Z     
2025-05-07T20:33:22.5325622Z         if contiguous:
2025-05-07T20:33:22.5325713Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5325805Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5325878Z     
2025-05-07T20:33:22.5325970Z         if scale_ub is not None:
2025-05-07T20:33:22.5326078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5326209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5326288Z             )
2025-05-07T20:33:22.5326365Z         else:
2025-05-07T20:33:22.5326466Z             scale_ub_tensor = None
2025-05-07T20:33:22.5326541Z     
2025-05-07T20:33:22.5326670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5326763Z             op = silu_mul_quant
2025-05-07T20:33:22.5326849Z             if compiled:
2025-05-07T20:33:22.5326947Z                 op = torch.compile(op)
2025-05-07T20:33:22.5327053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5327130Z     
2025-05-07T20:33:22.5327219Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5327224Z 
2025-05-07T20:33:22.5327319Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5327450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5327549Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5327650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5328062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5328193Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5328684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5328820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5329174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5329402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5329740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5329839Z     kernel = self.compile(
2025-05-07T20:33:22.5330218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5330393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5330563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5330568Z 
2025-05-07T20:33:22.5330773Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06cf8c0>
2025-05-07T20:33:22.5331555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5332059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf368c20>}
2025-05-07T20:33:22.5332804Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5332999Z context = <triton._C.libtriton.ir.context object at 0x7f32cf3f0bb0>
2025-05-07T20:33:22.5333006Z 
2025-05-07T20:33:22.5333172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5333443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5333549Z                            module_map=module_map)
2025-05-07T20:33:22.5333709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5333811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5333886Z E       ^
2025-05-07T20:33:22.5334240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5334250Z 
2025-05-07T20:33:22.5334658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5334662Z 
2025-05-07T20:33:22.5334764Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5334995Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5335075Z     T=2048,
2025-05-07T20:33:22.5335152Z     D=5120,
2025-05-07T20:33:22.5335238Z     scale_ub=None,
2025-05-07T20:33:22.5335329Z     contiguous=False,
2025-05-07T20:33:22.5335410Z     compiled=True,
2025-05-07T20:33:22.5335484Z )
2025-05-07T20:33:22.5335702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5335876Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5335881Z 
2025-05-07T20:33:22.5335959Z     @given(
2025-05-07T20:33:22.5336077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5336179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5336294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5336410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5336526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5336683Z     )
2025-05-07T20:33:22.5336931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5337024Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5337139Z         self,
2025-05-07T20:33:22.5337225Z         T: int,
2025-05-07T20:33:22.5337301Z         D: int,
2025-05-07T20:33:22.5337398Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5337488Z         contiguous: bool,
2025-05-07T20:33:22.5337573Z         compiled: bool,
2025-05-07T20:33:22.5337651Z     ) -> None:
2025-05-07T20:33:22.5337749Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5337822Z     
2025-05-07T20:33:22.5337988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5338066Z     
2025-05-07T20:33:22.5338156Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5338279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5338373Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5338458Z         x0 = x[:, :D]
2025-05-07T20:33:22.5338541Z         x1 = x[:, D:]
2025-05-07T20:33:22.5338681Z     
2025-05-07T20:33:22.5338767Z         if contiguous:
2025-05-07T20:33:22.5338861Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5338953Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5339025Z     
2025-05-07T20:33:22.5339117Z         if scale_ub is not None:
2025-05-07T20:33:22.5339226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5339359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5339435Z             )
2025-05-07T20:33:22.5339514Z         else:
2025-05-07T20:33:22.5339609Z             scale_ub_tensor = None
2025-05-07T20:33:22.5339681Z     
2025-05-07T20:33:22.5339814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5339905Z             op = silu_mul_quant
2025-05-07T20:33:22.5339993Z             if compiled:
2025-05-07T20:33:22.5340093Z                 op = torch.compile(op)
2025-05-07T20:33:22.5340202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5340280Z     
2025-05-07T20:33:22.5340372Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5340376Z 
2025-05-07T20:33:22.5340473Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5340608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5340708Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5340805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5341172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5341268Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5341761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5341857Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5342215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5342443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5342779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5342874Z     kernel = self.compile(
2025-05-07T20:33:22.5343256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5343429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5343558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5343563Z 
2025-05-07T20:33:22.5343764Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06ccb90>
2025-05-07T20:33:22.5344586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5345132Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf3699e0>}
2025-05-07T20:33:22.5345915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5346108Z context = <triton._C.libtriton.ir.context object at 0x7f32cf31beb0>
2025-05-07T20:33:22.5346112Z 
2025-05-07T20:33:22.5346274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5346543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5346650Z                            module_map=module_map)
2025-05-07T20:33:22.5346809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5346916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5347031Z E       ^
2025-05-07T20:33:22.5347386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5347394Z 
2025-05-07T20:33:22.5347809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5347814Z 
2025-05-07T20:33:22.5347915Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5348138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5348214Z     T=2048,
2025-05-07T20:33:22.5348290Z     D=5120,
2025-05-07T20:33:22.5348374Z     scale_ub=1200.0,
2025-05-07T20:33:22.5348460Z     contiguous=False,
2025-05-07T20:33:22.5348541Z     compiled=True,
2025-05-07T20:33:22.5348616Z )
2025-05-07T20:33:22.5348833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5349012Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5349021Z 
2025-05-07T20:33:22.5349097Z     @given(
2025-05-07T20:33:22.5349215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5349318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5349433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5349549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5349666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5349740Z     )
2025-05-07T20:33:22.5349981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5350078Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5350154Z         self,
2025-05-07T20:33:22.5350231Z         T: int,
2025-05-07T20:33:22.5350309Z         D: int,
2025-05-07T20:33:22.5350406Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5350497Z         contiguous: bool,
2025-05-07T20:33:22.5350588Z         compiled: bool,
2025-05-07T20:33:22.5350666Z     ) -> None:
2025-05-07T20:33:22.5350765Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5350837Z     
2025-05-07T20:33:22.5351005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5351085Z     
2025-05-07T20:33:22.5351177Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5351300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5351391Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5351471Z         x0 = x[:, :D]
2025-05-07T20:33:22.5351549Z         x1 = x[:, D:]
2025-05-07T20:33:22.5351625Z     
2025-05-07T20:33:22.5351707Z         if contiguous:
2025-05-07T20:33:22.5351799Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5351887Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5351959Z     
2025-05-07T20:33:22.5352053Z         if scale_ub is not None:
2025-05-07T20:33:22.5352159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5352377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5352459Z             )
2025-05-07T20:33:22.5352536Z         else:
2025-05-07T20:33:22.5352630Z             scale_ub_tensor = None
2025-05-07T20:33:22.5352747Z     
2025-05-07T20:33:22.5352875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5352966Z             op = silu_mul_quant
2025-05-07T20:33:22.5353056Z             if compiled:
2025-05-07T20:33:22.5353154Z                 op = torch.compile(op)
2025-05-07T20:33:22.5353261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5353334Z     
2025-05-07T20:33:22.5353423Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5353427Z 
2025-05-07T20:33:22.5353526Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5353654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5353754Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5353858Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5354269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5354364Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5354859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5354955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5355313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5355532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5355917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5356014Z     kernel = self.compile(
2025-05-07T20:33:22.5356391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5356573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5356705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5356712Z 
2025-05-07T20:33:22.5356914Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f06cd730>
2025-05-07T20:33:22.5357691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5358193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf36ab60>}
2025-05-07T20:33:22.5358944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5359136Z context = <triton._C.libtriton.ir.context object at 0x7f32cf4b29f0>
2025-05-07T20:33:22.5359140Z 
2025-05-07T20:33:22.5359303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5359570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5359675Z                            module_map=module_map)
2025-05-07T20:33:22.5359837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5359936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5360012Z E       ^
2025-05-07T20:33:22.5360369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5360374Z 
2025-05-07T20:33:22.5360784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5360788Z 
2025-05-07T20:33:22.5360980Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5361206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5361285Z     T=4096,
2025-05-07T20:33:22.5361404Z     D=5120,
2025-05-07T20:33:22.5361487Z     scale_ub=1200.0,
2025-05-07T20:33:22.5361571Z     contiguous=True,
2025-05-07T20:33:22.5361656Z     compiled=True,
2025-05-07T20:33:22.5361729Z )
2025-05-07T20:33:22.5361947Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5362119Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5362124Z 
2025-05-07T20:33:22.5362202Z     @given(
2025-05-07T20:33:22.5362324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5362422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5362537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5362657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5362774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5362892Z     )
2025-05-07T20:33:22.5363137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5363235Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5363311Z         self,
2025-05-07T20:33:22.5363393Z         T: int,
2025-05-07T20:33:22.5363470Z         D: int,
2025-05-07T20:33:22.5363566Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5363657Z         contiguous: bool,
2025-05-07T20:33:22.5363742Z         compiled: bool,
2025-05-07T20:33:22.5363820Z     ) -> None:
2025-05-07T20:33:22.5363917Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5363991Z     
2025-05-07T20:33:22.5364158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5364234Z     
2025-05-07T20:33:22.5364325Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5364451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5364547Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5364626Z         x0 = x[:, :D]
2025-05-07T20:33:22.5364712Z         x1 = x[:, D:]
2025-05-07T20:33:22.5364785Z     
2025-05-07T20:33:22.5364868Z         if contiguous:
2025-05-07T20:33:22.5364963Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5365052Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5365124Z     
2025-05-07T20:33:22.5365219Z         if scale_ub is not None:
2025-05-07T20:33:22.5365325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5365691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5365810Z             )
2025-05-07T20:33:22.5365899Z         else:
2025-05-07T20:33:22.5366000Z             scale_ub_tensor = None
2025-05-07T20:33:22.5366073Z     
2025-05-07T20:33:22.5366203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5366295Z             op = silu_mul_quant
2025-05-07T20:33:22.5366380Z             if compiled:
2025-05-07T20:33:22.5366485Z                 op = torch.compile(op)
2025-05-07T20:33:22.5366594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5366666Z     
2025-05-07T20:33:22.5366757Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5366763Z 
2025-05-07T20:33:22.5366863Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5366992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5367099Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5367197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5367561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5367658Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5368148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5368246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5368696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5368973Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5369406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5369502Z     kernel = self.compile(
2025-05-07T20:33:22.5369881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5370057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5370185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5370189Z 
2025-05-07T20:33:22.5370396Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf477ad0>
2025-05-07T20:33:22.5371232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5371740Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b8180>}
2025-05-07T20:33:22.5372490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5372678Z context = <triton._C.libtriton.ir.context object at 0x7f32cf43bbb0>
2025-05-07T20:33:22.5372683Z 
2025-05-07T20:33:22.5372849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5373112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5373218Z                            module_map=module_map)
2025-05-07T20:33:22.5373386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5373488Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5373566Z E       ^
2025-05-07T20:33:22.5373922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5373929Z 
2025-05-07T20:33:22.5374338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5374342Z 
2025-05-07T20:33:22.5374448Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5374670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5374750Z     T=128,
2025-05-07T20:33:22.5374832Z     D=5120,
2025-05-07T20:33:22.5374916Z     scale_ub=1200.0,
2025-05-07T20:33:22.5375001Z     contiguous=False,
2025-05-07T20:33:22.5375087Z     compiled=True,
2025-05-07T20:33:22.5375160Z )
2025-05-07T20:33:22.5375381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5375559Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5375563Z 
2025-05-07T20:33:22.5375641Z     @given(
2025-05-07T20:33:22.5375764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5375862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5375974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5376094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5376207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5376284Z     )
2025-05-07T20:33:22.5376524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5376618Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5376700Z         self,
2025-05-07T20:33:22.5376776Z         T: int,
2025-05-07T20:33:22.5376852Z         D: int,
2025-05-07T20:33:22.5376953Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5377126Z         contiguous: bool,
2025-05-07T20:33:22.5377215Z         compiled: bool,
2025-05-07T20:33:22.5377298Z     ) -> None:
2025-05-07T20:33:22.5377393Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5377505Z     
2025-05-07T20:33:22.5377679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5377754Z     
2025-05-07T20:33:22.5377846Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5377972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5378061Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5378144Z         x0 = x[:, :D]
2025-05-07T20:33:22.5378224Z         x1 = x[:, D:]
2025-05-07T20:33:22.5378296Z     
2025-05-07T20:33:22.5378382Z         if contiguous:
2025-05-07T20:33:22.5378472Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5378561Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5378636Z     
2025-05-07T20:33:22.5378726Z         if scale_ub is not None:
2025-05-07T20:33:22.5378836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5379017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5379094Z             )
2025-05-07T20:33:22.5379170Z         else:
2025-05-07T20:33:22.5379269Z             scale_ub_tensor = None
2025-05-07T20:33:22.5379342Z     
2025-05-07T20:33:22.5379475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5379566Z             op = silu_mul_quant
2025-05-07T20:33:22.5379664Z             if compiled:
2025-05-07T20:33:22.5379780Z                 op = torch.compile(op)
2025-05-07T20:33:22.5379906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5379983Z     
2025-05-07T20:33:22.5380076Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5380081Z 
2025-05-07T20:33:22.5380176Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5380305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5380409Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5380514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5380884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5380980Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5381470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5381570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5381924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5382146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5382484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5382577Z     kernel = self.compile(
2025-05-07T20:33:22.5382962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5383139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5383266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5383273Z 
2025-05-07T20:33:22.5383481Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf475340>
2025-05-07T20:33:22.5384254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5384759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b8ea0>}
2025-05-07T20:33:22.5385555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5385782Z context = <triton._C.libtriton.ir.context object at 0x7f33f003e970>
2025-05-07T20:33:22.5385827Z 
2025-05-07T20:33:22.5385992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5386253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5386362Z                            module_map=module_map)
2025-05-07T20:33:22.5386521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5386619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5386700Z E       ^
2025-05-07T20:33:22.5387053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5387057Z 
2025-05-07T20:33:22.5387477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5387483Z 
2025-05-07T20:33:22.5387624Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5387847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5387935Z     T=16384,
2025-05-07T20:33:22.5388012Z     D=7168,
2025-05-07T20:33:22.5388095Z     scale_ub=1200.0,
2025-05-07T20:33:22.5388185Z     contiguous=True,
2025-05-07T20:33:22.5388267Z     compiled=True,
2025-05-07T20:33:22.5388340Z )
2025-05-07T20:33:22.5388557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5388735Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5388740Z 
2025-05-07T20:33:22.5388820Z     @given(
2025-05-07T20:33:22.5388939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5389037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5389156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5389278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5389395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5389473Z     )
2025-05-07T20:33:22.5389719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5389818Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5389893Z         self,
2025-05-07T20:33:22.5389970Z         T: int,
2025-05-07T20:33:22.5390049Z         D: int,
2025-05-07T20:33:22.5390146Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5390234Z         contiguous: bool,
2025-05-07T20:33:22.5390322Z         compiled: bool,
2025-05-07T20:33:22.5390400Z     ) -> None:
2025-05-07T20:33:22.5390494Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5390572Z     
2025-05-07T20:33:22.5390739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5390812Z     
2025-05-07T20:33:22.5390907Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5391035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5391129Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5391214Z         x0 = x[:, :D]
2025-05-07T20:33:22.5391293Z         x1 = x[:, D:]
2025-05-07T20:33:22.5391370Z     
2025-05-07T20:33:22.5391454Z         if contiguous:
2025-05-07T20:33:22.5391547Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5391637Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5391709Z     
2025-05-07T20:33:22.5391801Z         if scale_ub is not None:
2025-05-07T20:33:22.5391909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5392040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5392116Z             )
2025-05-07T20:33:22.5392195Z         else:
2025-05-07T20:33:22.5392289Z             scale_ub_tensor = None
2025-05-07T20:33:22.5392361Z     
2025-05-07T20:33:22.5392492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5392582Z             op = silu_mul_quant
2025-05-07T20:33:22.5392758Z             if compiled:
2025-05-07T20:33:22.5392861Z                 op = torch.compile(op)
2025-05-07T20:33:22.5392966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5393085Z     
2025-05-07T20:33:22.5393175Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5393180Z 
2025-05-07T20:33:22.5393277Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5393408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5393512Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5393611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5393978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5394070Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5394562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5394662Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5395061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5395288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5395629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5395802Z     kernel = self.compile(
2025-05-07T20:33:22.5396182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5396355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5396487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5396492Z 
2025-05-07T20:33:22.5396694Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf476540>
2025-05-07T20:33:22.5397473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5397985Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4ba0c0>}
2025-05-07T20:33:22.5398727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5398919Z context = <triton._C.libtriton.ir.context object at 0x7f33f0034eb0>
2025-05-07T20:33:22.5398923Z 
2025-05-07T20:33:22.5399086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5399350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5399461Z                            module_map=module_map)
2025-05-07T20:33:22.5399623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5399726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5399806Z E       ^
2025-05-07T20:33:22.5400159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5400164Z 
2025-05-07T20:33:22.5400575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5400579Z 
2025-05-07T20:33:22.5400682Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5400909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5400987Z     T=16384,
2025-05-07T20:33:22.5401064Z     D=5120,
2025-05-07T20:33:22.5401151Z     scale_ub=1200.0,
2025-05-07T20:33:22.5401237Z     contiguous=True,
2025-05-07T20:33:22.5401320Z     compiled=False,
2025-05-07T20:33:22.5401509Z )
2025-05-07T20:33:22.5401730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5401908Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5401951Z 
2025-05-07T20:33:22.5402029Z     @given(
2025-05-07T20:33:22.5402147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5402250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5402365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5402482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5402601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5402676Z     )
2025-05-07T20:33:22.5402919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5403014Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5403090Z         self,
2025-05-07T20:33:22.5403170Z         T: int,
2025-05-07T20:33:22.5403252Z         D: int,
2025-05-07T20:33:22.5403351Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5403485Z         contiguous: bool,
2025-05-07T20:33:22.5403572Z         compiled: bool,
2025-05-07T20:33:22.5403655Z     ) -> None:
2025-05-07T20:33:22.5403754Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5403827Z     
2025-05-07T20:33:22.5403994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5404072Z     
2025-05-07T20:33:22.5404163Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5404286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5404378Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5404457Z         x0 = x[:, :D]
2025-05-07T20:33:22.5404537Z         x1 = x[:, D:]
2025-05-07T20:33:22.5404612Z     
2025-05-07T20:33:22.5404695Z         if contiguous:
2025-05-07T20:33:22.5404789Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5404878Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5404951Z     
2025-05-07T20:33:22.5405049Z         if scale_ub is not None:
2025-05-07T20:33:22.5405159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5405292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5405375Z             )
2025-05-07T20:33:22.5405451Z         else:
2025-05-07T20:33:22.5405545Z             scale_ub_tensor = None
2025-05-07T20:33:22.5405621Z     
2025-05-07T20:33:22.5405750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5405840Z             op = silu_mul_quant
2025-05-07T20:33:22.5405927Z             if compiled:
2025-05-07T20:33:22.5406025Z                 op = torch.compile(op)
2025-05-07T20:33:22.5406133Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5406206Z     
2025-05-07T20:33:22.5406297Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5406301Z 
2025-05-07T20:33:22.5406401Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5406530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5406634Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5406739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5407233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5407335Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5407692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5407913Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5408253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5408347Z     kernel = self.compile(
2025-05-07T20:33:22.5408724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5408949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5409113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5409118Z 
2025-05-07T20:33:22.5409323Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf476660>
2025-05-07T20:33:22.5410139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5410641Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf4b9a80>}
2025-05-07T20:33:22.5411387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5411582Z context = <triton._C.libtriton.ir.context object at 0x7f33f005dc70>
2025-05-07T20:33:22.5411625Z 
2025-05-07T20:33:22.5411792Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5412055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5412161Z                            module_map=module_map)
2025-05-07T20:33:22.5412323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5412422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5412503Z E       ^
2025-05-07T20:33:22.5412855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5412859Z 
2025-05-07T20:33:22.5413270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5413274Z 
2025-05-07T20:33:22.5413378Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5413607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5413688Z     T=1,
2025-05-07T20:33:22.5413768Z     D=7168,
2025-05-07T20:33:22.5413854Z     scale_ub=1200.0,
2025-05-07T20:33:22.5413943Z     contiguous=False,
2025-05-07T20:33:22.5414026Z     compiled=False,
2025-05-07T20:33:22.5414100Z )
2025-05-07T20:33:22.5414317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5414483Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5414487Z 
2025-05-07T20:33:22.5414564Z     @given(
2025-05-07T20:33:22.5414685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5414784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5414902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5415018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5415134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5415213Z     )
2025-05-07T20:33:22.5415457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5415551Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5415631Z         self,
2025-05-07T20:33:22.5415707Z         T: int,
2025-05-07T20:33:22.5415784Z         D: int,
2025-05-07T20:33:22.5415887Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5415976Z         contiguous: bool,
2025-05-07T20:33:22.5416060Z         compiled: bool,
2025-05-07T20:33:22.5416142Z     ) -> None:
2025-05-07T20:33:22.5416236Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5416312Z     
2025-05-07T20:33:22.5416477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5416552Z     
2025-05-07T20:33:22.5416645Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5416768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5416855Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5417027Z         x0 = x[:, :D]
2025-05-07T20:33:22.5417107Z         x1 = x[:, D:]
2025-05-07T20:33:22.5417182Z     
2025-05-07T20:33:22.5417270Z         if contiguous:
2025-05-07T20:33:22.5417361Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5417490Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5417566Z     
2025-05-07T20:33:22.5417656Z         if scale_ub is not None:
2025-05-07T20:33:22.5417761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5417897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5417973Z             )
2025-05-07T20:33:22.5418052Z         else:
2025-05-07T20:33:22.5418145Z             scale_ub_tensor = None
2025-05-07T20:33:22.5418218Z     
2025-05-07T20:33:22.5418350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5418440Z             op = silu_mul_quant
2025-05-07T20:33:22.5418526Z             if compiled:
2025-05-07T20:33:22.5418626Z                 op = torch.compile(op)
2025-05-07T20:33:22.5418737Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5418809Z     
2025-05-07T20:33:22.5418941Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5418946Z 
2025-05-07T20:33:22.5419043Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5419179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5419278Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5419377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5419912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5420021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5420378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5420602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5420943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5424206Z     kernel = self.compile(
2025-05-07T20:33:22.5424612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5424795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5424923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5424927Z 
2025-05-07T20:33:22.5425132Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b5d60>
2025-05-07T20:33:22.5425907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5426410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1d80e0>}
2025-05-07T20:33:22.5427164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5427358Z context = <triton._C.libtriton.ir.context object at 0x7f32cf069db0>
2025-05-07T20:33:22.5427362Z 
2025-05-07T20:33:22.5427529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5427791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5427898Z                            module_map=module_map)
2025-05-07T20:33:22.5428063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5428164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5428244Z E       ^
2025-05-07T20:33:22.5428664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5428710Z 
2025-05-07T20:33:22.5429123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5429166Z 
2025-05-07T20:33:22.5429273Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5429495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5429576Z     T=4096,
2025-05-07T20:33:22.5429651Z     D=7168,
2025-05-07T20:33:22.5429746Z     scale_ub=1200.0,
2025-05-07T20:33:22.5429852Z     contiguous=False,
2025-05-07T20:33:22.5429948Z     compiled=True,
2025-05-07T20:33:22.5430032Z )
2025-05-07T20:33:22.5430252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5430426Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5430431Z 
2025-05-07T20:33:22.5430509Z     @given(
2025-05-07T20:33:22.5430636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5430774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5430894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5431013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5431125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5431204Z     )
2025-05-07T20:33:22.5431445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5431539Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5431619Z         self,
2025-05-07T20:33:22.5431695Z         T: int,
2025-05-07T20:33:22.5431772Z         D: int,
2025-05-07T20:33:22.5431878Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5431968Z         contiguous: bool,
2025-05-07T20:33:22.5432052Z         compiled: bool,
2025-05-07T20:33:22.5432135Z     ) -> None:
2025-05-07T20:33:22.5432231Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5432305Z     
2025-05-07T20:33:22.5432480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5432559Z     
2025-05-07T20:33:22.5432653Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5432781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5432871Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5432955Z         x0 = x[:, :D]
2025-05-07T20:33:22.5433034Z         x1 = x[:, D:]
2025-05-07T20:33:22.5433107Z     
2025-05-07T20:33:22.5433194Z         if contiguous:
2025-05-07T20:33:22.5433287Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5433376Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5433451Z     
2025-05-07T20:33:22.5433541Z         if scale_ub is not None:
2025-05-07T20:33:22.5433649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5433785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5433860Z             )
2025-05-07T20:33:22.5433940Z         else:
2025-05-07T20:33:22.5434038Z             scale_ub_tensor = None
2025-05-07T20:33:22.5434112Z     
2025-05-07T20:33:22.5434248Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5434338Z             op = silu_mul_quant
2025-05-07T20:33:22.5434424Z             if compiled:
2025-05-07T20:33:22.5434528Z                 op = torch.compile(op)
2025-05-07T20:33:22.5434637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5434709Z     
2025-05-07T20:33:22.5434802Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5434807Z 
2025-05-07T20:33:22.5434903Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5435036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5435135Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5435233Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5435603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5435696Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5436359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5436462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5436857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5437081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5437419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5437513Z     kernel = self.compile(
2025-05-07T20:33:22.5437892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5438065Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5438195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5438205Z 
2025-05-07T20:33:22.5438450Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b5970>
2025-05-07T20:33:22.5439229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5439736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1d9300>}
2025-05-07T20:33:22.5440478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5440670Z context = <triton._C.libtriton.ir.context object at 0x7f32cf10ceb0>
2025-05-07T20:33:22.5440675Z 
2025-05-07T20:33:22.5440843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5441108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5441221Z                            module_map=module_map)
2025-05-07T20:33:22.5441381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5441483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5441561Z E       ^
2025-05-07T20:33:22.5441916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5441921Z 
2025-05-07T20:33:22.5442334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5442338Z 
2025-05-07T20:33:22.5442440Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5442661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5442748Z     T=128,
2025-05-07T20:33:22.5442825Z     D=7168,
2025-05-07T20:33:22.5442914Z     scale_ub=1200.0,
2025-05-07T20:33:22.5443000Z     contiguous=False,
2025-05-07T20:33:22.5443083Z     compiled=True,
2025-05-07T20:33:22.5443160Z )
2025-05-07T20:33:22.5443377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5443547Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:22.5443551Z 
2025-05-07T20:33:22.5443633Z     @given(
2025-05-07T20:33:22.5443751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5443849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5443966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5444083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5444198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5444273Z     )
2025-05-07T20:33:22.5444561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5444696Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5444776Z         self,
2025-05-07T20:33:22.5444857Z         T: int,
2025-05-07T20:33:22.5444978Z         D: int,
2025-05-07T20:33:22.5445080Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5445169Z         contiguous: bool,
2025-05-07T20:33:22.5445256Z         compiled: bool,
2025-05-07T20:33:22.5445334Z     ) -> None:
2025-05-07T20:33:22.5445429Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5445503Z     
2025-05-07T20:33:22.5445670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5445744Z     
2025-05-07T20:33:22.5445837Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5445963Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5446055Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5446135Z         x0 = x[:, :D]
2025-05-07T20:33:22.5446215Z         x1 = x[:, D:]
2025-05-07T20:33:22.5446291Z     
2025-05-07T20:33:22.5446381Z         if contiguous:
2025-05-07T20:33:22.5446513Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5446606Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5446679Z     
2025-05-07T20:33:22.5446773Z         if scale_ub is not None:
2025-05-07T20:33:22.5446883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5447016Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5447096Z             )
2025-05-07T20:33:22.5447174Z         else:
2025-05-07T20:33:22.5447267Z             scale_ub_tensor = None
2025-05-07T20:33:22.5447343Z     
2025-05-07T20:33:22.5447472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5447562Z             op = silu_mul_quant
2025-05-07T20:33:22.5447651Z             if compiled:
2025-05-07T20:33:22.5447750Z                 op = torch.compile(op)
2025-05-07T20:33:22.5447855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5447931Z     
2025-05-07T20:33:22.5448027Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5448031Z 
2025-05-07T20:33:22.5448130Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5448260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5448364Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5448467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5448830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5448923Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5449414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5449511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5449902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5450146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5450486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5450584Z     kernel = self.compile(
2025-05-07T20:33:22.5450964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5451138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5451268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5451272Z 
2025-05-07T20:33:22.5451474Z self = <triton.compiler.compiler.ASTSource object at 0x7f33f00b6c90>
2025-05-07T20:33:22.5452252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5452801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1da020>}
2025-05-07T20:33:22.5453582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5453817Z context = <triton._C.libtriton.ir.context object at 0x7f32cf0474b0>
2025-05-07T20:33:22.5453821Z 
2025-05-07T20:33:22.5453985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5454251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5454357Z                            module_map=module_map)
2025-05-07T20:33:22.5454517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5454617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5454694Z E       ^
2025-05-07T20:33:22.5455095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5455100Z 
2025-05-07T20:33:22.5455510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5455518Z 
2025-05-07T20:33:22.5455620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5455845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5455923Z     T=2048,
2025-05-07T20:33:22.5455999Z     D=7168,
2025-05-07T20:33:22.5456087Z     scale_ub=None,
2025-05-07T20:33:22.5456172Z     contiguous=True,
2025-05-07T20:33:22.5456257Z     compiled=True,
2025-05-07T20:33:22.5456330Z )
2025-05-07T20:33:22.5456545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5456717Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.5456724Z 
2025-05-07T20:33:22.5456803Z     @given(
2025-05-07T20:33:22.5456923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5457026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5457143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5457259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5457375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5457449Z     )
2025-05-07T20:33:22.5457693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5457785Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5457863Z         self,
2025-05-07T20:33:22.5457944Z         T: int,
2025-05-07T20:33:22.5458020Z         D: int,
2025-05-07T20:33:22.5458118Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5458211Z         contiguous: bool,
2025-05-07T20:33:22.5458297Z         compiled: bool,
2025-05-07T20:33:22.5458374Z     ) -> None:
2025-05-07T20:33:22.5458476Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5458548Z     
2025-05-07T20:33:22.5458718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5458797Z     
2025-05-07T20:33:22.5458892Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5459021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5459109Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5459188Z         x0 = x[:, :D]
2025-05-07T20:33:22.5459270Z         x1 = x[:, D:]
2025-05-07T20:33:22.5459342Z     
2025-05-07T20:33:22.5459426Z         if contiguous:
2025-05-07T20:33:22.5459522Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5459611Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5459698Z     
2025-05-07T20:33:22.5459803Z         if scale_ub is not None:
2025-05-07T20:33:22.5459931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5460066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5460147Z             )
2025-05-07T20:33:22.5460316Z         else:
2025-05-07T20:33:22.5460411Z             scale_ub_tensor = None
2025-05-07T20:33:22.5460492Z     
2025-05-07T20:33:22.5460622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5460755Z             op = silu_mul_quant
2025-05-07T20:33:22.5460841Z             if compiled:
2025-05-07T20:33:22.5460939Z                 op = torch.compile(op)
2025-05-07T20:33:22.5461044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5461122Z     
2025-05-07T20:33:22.5461213Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5461217Z 
2025-05-07T20:33:22.5461316Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5461448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5461549Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5461648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5462023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5462117Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5462672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5462774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5463130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5463354Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5463692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5463787Z     kernel = self.compile(
2025-05-07T20:33:22.5464165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5464338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5464474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5464481Z 
2025-05-07T20:33:22.5464684Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf01c500>
2025-05-07T20:33:22.5465711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5466238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf1db240>}
2025-05-07T20:33:22.5466981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5467181Z context = <triton._C.libtriton.ir.context object at 0x7f32cf09e3b0>
2025-05-07T20:33:22.5467189Z 
2025-05-07T20:33:22.5467354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5467618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5467729Z                            module_map=module_map)
2025-05-07T20:33:22.5467889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5467992Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5468068Z E       ^
2025-05-07T20:33:22.5468421Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5468428Z 
2025-05-07T20:33:22.5468838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5468842Z 
2025-05-07T20:33:22.5468944Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5469266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5469398Z     T=16384,
2025-05-07T20:33:22.5469478Z     D=5120,
2025-05-07T20:33:22.5469565Z     scale_ub=None,
2025-05-07T20:33:22.5469652Z     contiguous=False,
2025-05-07T20:33:22.5469794Z     compiled=False,
2025-05-07T20:33:22.5469869Z )
2025-05-07T20:33:22.5470087Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5470266Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5470270Z 
2025-05-07T20:33:22.5470347Z     @given(
2025-05-07T20:33:22.5470464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5470565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5470680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5470796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5470911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5470986Z     )
2025-05-07T20:33:22.5471287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5471385Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5471461Z         self,
2025-05-07T20:33:22.5471544Z         T: int,
2025-05-07T20:33:22.5471619Z         D: int,
2025-05-07T20:33:22.5471719Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5471809Z         contiguous: bool,
2025-05-07T20:33:22.5471894Z         compiled: bool,
2025-05-07T20:33:22.5471971Z     ) -> None:
2025-05-07T20:33:22.5472069Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5472142Z     
2025-05-07T20:33:22.5472307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5472386Z     
2025-05-07T20:33:22.5472477Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5472601Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5474426Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5474440Z 
2025-05-07T20:33:22.5474558Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.5474566Z 
2025-05-07T20:33:22.5474669Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5474890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5474971Z     T=4096,
2025-05-07T20:33:22.5475048Z     D=7168,
2025-05-07T20:33:22.5475130Z     scale_ub=1200.0,
2025-05-07T20:33:22.5475216Z     contiguous=True,
2025-05-07T20:33:22.5475299Z     compiled=True,
2025-05-07T20:33:22.5475376Z )
2025-05-07T20:33:22.5475596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5475817Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5475824Z 
2025-05-07T20:33:22.5475903Z     @given(
2025-05-07T20:33:22.5476023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5476121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5476237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5476352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5476464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5476541Z     )
2025-05-07T20:33:22.5476782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5476877Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5476955Z         self,
2025-05-07T20:33:22.5477033Z         T: int,
2025-05-07T20:33:22.5477110Z         D: int,
2025-05-07T20:33:22.5477297Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5477390Z         contiguous: bool,
2025-05-07T20:33:22.5477479Z         compiled: bool,
2025-05-07T20:33:22.5477558Z     ) -> None:
2025-05-07T20:33:22.5477693Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5477768Z     
2025-05-07T20:33:22.5477935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5478010Z     
2025-05-07T20:33:22.5478104Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5478227Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5480068Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5480077Z 
2025-05-07T20:33:22.5480199Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.5480203Z 
2025-05-07T20:33:22.5480307Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5480530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5480610Z     T=16384,
2025-05-07T20:33:22.5480692Z     D=7168,
2025-05-07T20:33:22.5480776Z     scale_ub=None,
2025-05-07T20:33:22.5480863Z     contiguous=False,
2025-05-07T20:33:22.5480951Z     compiled=False,
2025-05-07T20:33:22.5481025Z )
2025-05-07T20:33:22.5481238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5481416Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5481421Z 
2025-05-07T20:33:22.5481501Z     @given(
2025-05-07T20:33:22.5481624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5481727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5481841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5481963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5482074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5482148Z     )
2025-05-07T20:33:22.5482393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5482486Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5482561Z         self,
2025-05-07T20:33:22.5482639Z         T: int,
2025-05-07T20:33:22.5482716Z         D: int,
2025-05-07T20:33:22.5482814Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5482905Z         contiguous: bool,
2025-05-07T20:33:22.5482990Z         compiled: bool,
2025-05-07T20:33:22.5483067Z     ) -> None:
2025-05-07T20:33:22.5483165Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5483243Z     
2025-05-07T20:33:22.5483415Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5485215Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5485223Z 
2025-05-07T20:33:22.5485340Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5485344Z 
2025-05-07T20:33:22.5485444Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5485710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5485827Z     T=2048,
2025-05-07T20:33:22.5485905Z     D=7168,
2025-05-07T20:33:22.5485989Z     scale_ub=1200.0,
2025-05-07T20:33:22.5486078Z     contiguous=True,
2025-05-07T20:33:22.5486200Z     compiled=True,
2025-05-07T20:33:22.5486273Z )
2025-05-07T20:33:22.5486490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5486659Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5486664Z 
2025-05-07T20:33:22.5486744Z     @given(
2025-05-07T20:33:22.5486864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5486961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5487078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5487193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5487306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5487382Z     )
2025-05-07T20:33:22.5487624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5487759Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5487839Z         self,
2025-05-07T20:33:22.5487917Z         T: int,
2025-05-07T20:33:22.5487996Z         D: int,
2025-05-07T20:33:22.5488095Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5488184Z         contiguous: bool,
2025-05-07T20:33:22.5488268Z         compiled: bool,
2025-05-07T20:33:22.5488348Z     ) -> None:
2025-05-07T20:33:22.5488441Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5488515Z     
2025-05-07T20:33:22.5488680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5488754Z     
2025-05-07T20:33:22.5488848Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5488970Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5490756Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5490769Z 
2025-05-07T20:33:22.5490886Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.5490890Z 
2025-05-07T20:33:22.5490991Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5491213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5491289Z     T=2048,
2025-05-07T20:33:22.5491366Z     D=7168,
2025-05-07T20:33:22.5491452Z     scale_ub=None,
2025-05-07T20:33:22.5491537Z     contiguous=True,
2025-05-07T20:33:22.5491622Z     compiled=False,
2025-05-07T20:33:22.5491694Z )
2025-05-07T20:33:22.5491911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5492087Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5492093Z 
2025-05-07T20:33:22.5492174Z     @given(
2025-05-07T20:33:22.5492290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5492389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5492502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5492617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5492733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5492806Z     )
2025-05-07T20:33:22.5493049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5493141Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5493217Z         self,
2025-05-07T20:33:22.5493297Z         T: int,
2025-05-07T20:33:22.5493372Z         D: int,
2025-05-07T20:33:22.5493554Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5493646Z         contiguous: bool,
2025-05-07T20:33:22.5493733Z         compiled: bool,
2025-05-07T20:33:22.5493812Z     ) -> None:
2025-05-07T20:33:22.5493972Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5494045Z     
2025-05-07T20:33:22.5494211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5494288Z     
2025-05-07T20:33:22.5494379Z >       x_sign = torch.sign(x)
2025-05-07T20:33:22.5496170Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5496178Z 
2025-05-07T20:33:22.5496331Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:22.5496337Z 
2025-05-07T20:33:22.5496445Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5496666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5496744Z     T=1,
2025-05-07T20:33:22.5496827Z     D=7168,
2025-05-07T20:33:22.5496910Z     scale_ub=1200.0,
2025-05-07T20:33:22.5496995Z     contiguous=True,
2025-05-07T20:33:22.5497084Z     compiled=False,
2025-05-07T20:33:22.5497158Z )
2025-05-07T20:33:22.5497373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5497541Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5497546Z 
2025-05-07T20:33:22.5497624Z     @given(
2025-05-07T20:33:22.5497747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5497849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5497968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5498088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5498201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5498274Z     )
2025-05-07T20:33:22.5498517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5498609Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5498683Z         self,
2025-05-07T20:33:22.5498763Z         T: int,
2025-05-07T20:33:22.5498839Z         D: int,
2025-05-07T20:33:22.5498938Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5499025Z         contiguous: bool,
2025-05-07T20:33:22.5499110Z         compiled: bool,
2025-05-07T20:33:22.5499189Z     ) -> None:
2025-05-07T20:33:22.5499281Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5499354Z     
2025-05-07T20:33:22.5499523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5499602Z     
2025-05-07T20:33:22.5499692Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5499821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5499910Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5499992Z         x0 = x[:, :D]
2025-05-07T20:33:22.5500076Z         x1 = x[:, D:]
2025-05-07T20:33:22.5500150Z     
2025-05-07T20:33:22.5500241Z         if contiguous:
2025-05-07T20:33:22.5500334Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5500422Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5500498Z     
2025-05-07T20:33:22.5500587Z         if scale_ub is not None:
2025-05-07T20:33:22.5500692Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5500830Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5500906Z             )
2025-05-07T20:33:22.5500982Z         else:
2025-05-07T20:33:22.5501079Z             scale_ub_tensor = None
2025-05-07T20:33:22.5501151Z     
2025-05-07T20:33:22.5501363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5501460Z             op = silu_mul_quant
2025-05-07T20:33:22.5501545Z             if compiled:
2025-05-07T20:33:22.5501642Z                 op = torch.compile(op)
2025-05-07T20:33:22.5501790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5501861Z     
2025-05-07T20:33:22.5501955Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5501960Z 
2025-05-07T20:33:22.5502055Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5502183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5502288Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5502388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5502886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5502986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5503346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5503610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5503952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5504044Z     kernel = self.compile(
2025-05-07T20:33:22.5504427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5504598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5504728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5504733Z 
2025-05-07T20:33:22.5504936Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf01f920>
2025-05-07T20:33:22.5505721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5506229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf0aa520>}
2025-05-07T20:33:22.5506974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5507167Z context = <triton._C.libtriton.ir.context object at 0x7f32cf276db0>
2025-05-07T20:33:22.5507171Z 
2025-05-07T20:33:22.5507335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5507598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5507708Z                            module_map=module_map)
2025-05-07T20:33:22.5507875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5507981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5508059Z E       ^
2025-05-07T20:33:22.5508413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5508420Z 
2025-05-07T20:33:22.5508833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5508838Z 
2025-05-07T20:33:22.5508944Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5509169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5509247Z     T=128,
2025-05-07T20:33:22.5509326Z     D=5120,
2025-05-07T20:33:22.5509412Z     scale_ub=None,
2025-05-07T20:33:22.5509497Z     contiguous=True,
2025-05-07T20:33:22.5509581Z     compiled=False,
2025-05-07T20:33:22.5509663Z )
2025-05-07T20:33:22.5509967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5510181Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5510186Z 
2025-05-07T20:33:22.5510268Z     @given(
2025-05-07T20:33:22.5510425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5510524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5510643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5510760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5510876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5510955Z     )
2025-05-07T20:33:22.5511197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5511294Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5511371Z         self,
2025-05-07T20:33:22.5511451Z         T: int,
2025-05-07T20:33:22.5511532Z         D: int,
2025-05-07T20:33:22.5511632Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5511728Z         contiguous: bool,
2025-05-07T20:33:22.5511865Z         compiled: bool,
2025-05-07T20:33:22.5511946Z     ) -> None:
2025-05-07T20:33:22.5512041Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5512120Z     
2025-05-07T20:33:22.5512287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5512362Z     
2025-05-07T20:33:22.5512452Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5512575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5512665Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5512745Z         x0 = x[:, :D]
2025-05-07T20:33:22.5512824Z         x1 = x[:, D:]
2025-05-07T20:33:22.5512900Z     
2025-05-07T20:33:22.5512982Z         if contiguous:
2025-05-07T20:33:22.5513074Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5513166Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5513239Z     
2025-05-07T20:33:22.5513329Z         if scale_ub is not None:
2025-05-07T20:33:22.5513438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5513577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5513661Z             )
2025-05-07T20:33:22.5513736Z         else:
2025-05-07T20:33:22.5513832Z             scale_ub_tensor = None
2025-05-07T20:33:22.5513908Z     
2025-05-07T20:33:22.5514035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5514126Z             op = silu_mul_quant
2025-05-07T20:33:22.5514212Z             if compiled:
2025-05-07T20:33:22.5514308Z                 op = torch.compile(op)
2025-05-07T20:33:22.5514411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5514486Z     
2025-05-07T20:33:22.5514575Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5514579Z 
2025-05-07T20:33:22.5514674Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5514806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5514906Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5515013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5515508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5515608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5516044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5516264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5516606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5516700Z     kernel = self.compile(
2025-05-07T20:33:22.5517078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5517253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5517427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5517466Z 
2025-05-07T20:33:22.5517673Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2a9d00>
2025-05-07T20:33:22.5518493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5518994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cf0ab420>}
2025-05-07T20:33:22.5519748Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5519972Z context = <triton._C.libtriton.ir.context object at 0x7f32cef47fb0>
2025-05-07T20:33:22.5519987Z 
2025-05-07T20:33:22.5520197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5520460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5520570Z                            module_map=module_map)
2025-05-07T20:33:22.5520733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5520832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5520910Z E       ^
2025-05-07T20:33:22.5521267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5521272Z 
2025-05-07T20:33:22.5521680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5521685Z 
2025-05-07T20:33:22.5521789Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5522012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5522091Z     T=128,
2025-05-07T20:33:22.5522172Z     D=7168,
2025-05-07T20:33:22.5522254Z     scale_ub=None,
2025-05-07T20:33:22.5522338Z     contiguous=True,
2025-05-07T20:33:22.5522427Z     compiled=False,
2025-05-07T20:33:22.5522499Z )
2025-05-07T20:33:22.5522717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5522886Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5522891Z 
2025-05-07T20:33:22.5522967Z     @given(
2025-05-07T20:33:22.5523087Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5523184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5523298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5523418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5523529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5523602Z     )
2025-05-07T20:33:22.5523852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5523948Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5524026Z         self,
2025-05-07T20:33:22.5524101Z         T: int,
2025-05-07T20:33:22.5524179Z         D: int,
2025-05-07T20:33:22.5524279Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5524370Z         contiguous: bool,
2025-05-07T20:33:22.5524455Z         compiled: bool,
2025-05-07T20:33:22.5524535Z     ) -> None:
2025-05-07T20:33:22.5524629Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5524702Z     
2025-05-07T20:33:22.5524872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5524945Z     
2025-05-07T20:33:22.5525036Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5525163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5525250Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5525333Z         x0 = x[:, :D]
2025-05-07T20:33:22.5525412Z         x1 = x[:, D:]
2025-05-07T20:33:22.5525593Z     
2025-05-07T20:33:22.5525678Z         if contiguous:
2025-05-07T20:33:22.5525771Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5525859Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5525975Z     
2025-05-07T20:33:22.5526067Z         if scale_ub is not None:
2025-05-07T20:33:22.5526172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5526306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5526381Z             )
2025-05-07T20:33:22.5526456Z         else:
2025-05-07T20:33:22.5526553Z             scale_ub_tensor = None
2025-05-07T20:33:22.5526625Z     
2025-05-07T20:33:22.5526753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5526845Z             op = silu_mul_quant
2025-05-07T20:33:22.5526930Z             if compiled:
2025-05-07T20:33:22.5527032Z                 op = torch.compile(op)
2025-05-07T20:33:22.5527136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5527210Z     
2025-05-07T20:33:22.5527307Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5527312Z 
2025-05-07T20:33:22.5527448Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5527578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5527687Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5527784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5528279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5528378Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5528733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5528954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5529291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5529390Z     kernel = self.compile(
2025-05-07T20:33:22.5529775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5529950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5530082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5530086Z 
2025-05-07T20:33:22.5530288Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2a9130>
2025-05-07T20:33:22.5531061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5531567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cef8c4a0>}
2025-05-07T20:33:22.5532318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5532513Z context = <triton._C.libtriton.ir.context object at 0x7f32ced40170>
2025-05-07T20:33:22.5532518Z 
2025-05-07T20:33:22.5532681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5532944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5533056Z                            module_map=module_map)
2025-05-07T20:33:22.5533216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5533317Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5533395Z E       ^
2025-05-07T20:33:22.5533751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5533835Z 
2025-05-07T20:33:22.5534251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5534256Z 
2025-05-07T20:33:22.5534394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5534619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5534696Z     T=2048,
2025-05-07T20:33:22.5534771Z     D=7168,
2025-05-07T20:33:22.5534856Z     scale_ub=1200.0,
2025-05-07T20:33:22.5534938Z     contiguous=True,
2025-05-07T20:33:22.5535020Z     compiled=False,
2025-05-07T20:33:22.5535095Z )
2025-05-07T20:33:22.5535310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5535484Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5535489Z 
2025-05-07T20:33:22.5535569Z     @given(
2025-05-07T20:33:22.5535686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5535794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5535947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5536064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5536183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5536258Z     )
2025-05-07T20:33:22.5536500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5536601Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5536676Z         self,
2025-05-07T20:33:22.5536753Z         T: int,
2025-05-07T20:33:22.5536832Z         D: int,
2025-05-07T20:33:22.5536929Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5537017Z         contiguous: bool,
2025-05-07T20:33:22.5537105Z         compiled: bool,
2025-05-07T20:33:22.5537184Z     ) -> None:
2025-05-07T20:33:22.5537280Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5537353Z     
2025-05-07T20:33:22.5537519Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5539326Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5539334Z 
2025-05-07T20:33:22.5539451Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5539455Z 
2025-05-07T20:33:22.5539559Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5539779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5539854Z     T=1,
2025-05-07T20:33:22.5539933Z     D=5120,
2025-05-07T20:33:22.5540020Z     scale_ub=1200.0,
2025-05-07T20:33:22.5540104Z     contiguous=True,
2025-05-07T20:33:22.5540192Z     compiled=False,
2025-05-07T20:33:22.5540266Z )
2025-05-07T20:33:22.5540483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5540650Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5540654Z 
2025-05-07T20:33:22.5540731Z     @given(
2025-05-07T20:33:22.5540848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5540946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5541058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5541176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5541287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5541360Z     )
2025-05-07T20:33:22.5541603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5541746Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5541862Z         self,
2025-05-07T20:33:22.5541938Z         T: int,
2025-05-07T20:33:22.5542018Z         D: int,
2025-05-07T20:33:22.5542119Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5542248Z         contiguous: bool,
2025-05-07T20:33:22.5542333Z         compiled: bool,
2025-05-07T20:33:22.5542413Z     ) -> None:
2025-05-07T20:33:22.5542506Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5542578Z     
2025-05-07T20:33:22.5542745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5542820Z     
2025-05-07T20:33:22.5542911Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5543037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5543124Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5543207Z         x0 = x[:, :D]
2025-05-07T20:33:22.5543286Z         x1 = x[:, D:]
2025-05-07T20:33:22.5543357Z     
2025-05-07T20:33:22.5543441Z         if contiguous:
2025-05-07T20:33:22.5543536Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5543627Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5543742Z     
2025-05-07T20:33:22.5543833Z         if scale_ub is not None:
2025-05-07T20:33:22.5543937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5544078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5544154Z             )
2025-05-07T20:33:22.5544230Z         else:
2025-05-07T20:33:22.5547451Z             scale_ub_tensor = None
2025-05-07T20:33:22.5547536Z     
2025-05-07T20:33:22.5547675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5547766Z             op = silu_mul_quant
2025-05-07T20:33:22.5547857Z             if compiled:
2025-05-07T20:33:22.5547956Z                 op = torch.compile(op)
2025-05-07T20:33:22.5548060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5548134Z     
2025-05-07T20:33:22.5548227Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5548232Z 
2025-05-07T20:33:22.5548342Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5548475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5548576Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5548682Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5549180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5549275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5549658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5549905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5550245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5550340Z     kernel = self.compile(
2025-05-07T20:33:22.5550722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5550899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5551027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5551034Z 
2025-05-07T20:33:22.5551238Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cf2aa9f0>
2025-05-07T20:33:22.5552019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5552520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32cef8da80>}
2025-05-07T20:33:22.5553337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5553566Z context = <triton._C.libtriton.ir.context object at 0x7f32cef26430>
2025-05-07T20:33:22.5553609Z 
2025-05-07T20:33:22.5553775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5554036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5554142Z                            module_map=module_map)
2025-05-07T20:33:22.5554306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5554405Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5554482Z E       ^
2025-05-07T20:33:22.5554838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5554842Z 
2025-05-07T20:33:22.5555254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5555261Z 
2025-05-07T20:33:22.5555410Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5555635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5555771Z     T=2048,
2025-05-07T20:33:22.5555855Z     D=5120,
2025-05-07T20:33:22.5555937Z     scale_ub=None,
2025-05-07T20:33:22.5556022Z     contiguous=True,
2025-05-07T20:33:22.5556106Z     compiled=False,
2025-05-07T20:33:22.5556180Z )
2025-05-07T20:33:22.5556402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5556572Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5556577Z 
2025-05-07T20:33:22.5556655Z     @given(
2025-05-07T20:33:22.5556775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5556874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5556990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5557119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5557234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5557307Z     )
2025-05-07T20:33:22.5557554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5557650Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5557732Z         self,
2025-05-07T20:33:22.5557809Z         T: int,
2025-05-07T20:33:22.5557886Z         D: int,
2025-05-07T20:33:22.5557988Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5558076Z         contiguous: bool,
2025-05-07T20:33:22.5558160Z         compiled: bool,
2025-05-07T20:33:22.5558242Z     ) -> None:
2025-05-07T20:33:22.5558336Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5558411Z     
2025-05-07T20:33:22.5558582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5558658Z     
2025-05-07T20:33:22.5558749Z >       x_sign = torch.sign(x)
2025-05-07T20:33:22.5560547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5560555Z 
2025-05-07T20:33:22.5560672Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:22.5560680Z 
2025-05-07T20:33:22.5560782Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5561003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5561085Z     T=16384,
2025-05-07T20:33:22.5561161Z     D=5120,
2025-05-07T20:33:22.5561244Z     scale_ub=None,
2025-05-07T20:33:22.5561451Z     contiguous=True,
2025-05-07T20:33:22.5561537Z     compiled=False,
2025-05-07T20:33:22.5561612Z )
2025-05-07T20:33:22.5561831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5562047Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5562052Z 
2025-05-07T20:33:22.5562136Z     @given(
2025-05-07T20:33:22.5562253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5562350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5562467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5562581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5562694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5562771Z     )
2025-05-07T20:33:22.5563014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5563107Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5563191Z         self,
2025-05-07T20:33:22.5563267Z         T: int,
2025-05-07T20:33:22.5563385Z         D: int,
2025-05-07T20:33:22.5563488Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5563577Z         contiguous: bool,
2025-05-07T20:33:22.5563671Z         compiled: bool,
2025-05-07T20:33:22.5563750Z     ) -> None:
2025-05-07T20:33:22.5563844Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5563920Z     
2025-05-07T20:33:22.5564088Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5566238Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5566255Z 
2025-05-07T20:33:22.5566376Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5566383Z 
2025-05-07T20:33:22.5566485Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5566714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5566793Z     T=4096,
2025-05-07T20:33:22.5566871Z     D=5120,
2025-05-07T20:33:22.5566957Z     scale_ub=None,
2025-05-07T20:33:22.5567045Z     contiguous=True,
2025-05-07T20:33:22.5567132Z     compiled=False,
2025-05-07T20:33:22.5567205Z )
2025-05-07T20:33:22.5567420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5567594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5567599Z 
2025-05-07T20:33:22.5567676Z     @given(
2025-05-07T20:33:22.5567796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5567901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5568019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5568135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5568252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5568330Z     )
2025-05-07T20:33:22.5568579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5568672Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5568750Z         self,
2025-05-07T20:33:22.5568832Z         T: int,
2025-05-07T20:33:22.5568909Z         D: int,
2025-05-07T20:33:22.5569009Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5569102Z         contiguous: bool,
2025-05-07T20:33:22.5569187Z         compiled: bool,
2025-05-07T20:33:22.5569266Z     ) -> None:
2025-05-07T20:33:22.5569366Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5569439Z     
2025-05-07T20:33:22.5569714Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5571552Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5571614Z 
2025-05-07T20:33:22.5571731Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5571740Z 
2025-05-07T20:33:22.5571840Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5572061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5572142Z     T=2048,
2025-05-07T20:33:22.5572225Z     D=5120,
2025-05-07T20:33:22.5572307Z     scale_ub=None,
2025-05-07T20:33:22.5572450Z     contiguous=False,
2025-05-07T20:33:22.5572535Z     compiled=False,
2025-05-07T20:33:22.5572608Z )
2025-05-07T20:33:22.5572828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5573002Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5573007Z 
2025-05-07T20:33:22.5573089Z     @given(
2025-05-07T20:33:22.5573206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5573309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5573426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5573540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5573653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5573731Z     )
2025-05-07T20:33:22.5573971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5574070Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5574152Z         self,
2025-05-07T20:33:22.5574235Z         T: int,
2025-05-07T20:33:22.5574312Z         D: int,
2025-05-07T20:33:22.5574414Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5574502Z         contiguous: bool,
2025-05-07T20:33:22.5574596Z         compiled: bool,
2025-05-07T20:33:22.5574675Z     ) -> None:
2025-05-07T20:33:22.5574769Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5574847Z     
2025-05-07T20:33:22.5575012Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5576796Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5576808Z 
2025-05-07T20:33:22.5576926Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5576931Z 
2025-05-07T20:33:22.5577031Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5577254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5577331Z     T=4096,
2025-05-07T20:33:22.5577408Z     D=7168,
2025-05-07T20:33:22.5577493Z     scale_ub=None,
2025-05-07T20:33:22.5577577Z     contiguous=True,
2025-05-07T20:33:22.5577667Z     compiled=True,
2025-05-07T20:33:22.5577742Z )
2025-05-07T20:33:22.5577959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5578128Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.5578133Z 
2025-05-07T20:33:22.5578210Z     @given(
2025-05-07T20:33:22.5578414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5578520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5578635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5578789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5578905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5578979Z     )
2025-05-07T20:33:22.5579225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5579319Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5579395Z         self,
2025-05-07T20:33:22.5579478Z         T: int,
2025-05-07T20:33:22.5579554Z         D: int,
2025-05-07T20:33:22.5579651Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5579744Z         contiguous: bool,
2025-05-07T20:33:22.5579830Z         compiled: bool,
2025-05-07T20:33:22.5579908Z     ) -> None:
2025-05-07T20:33:22.5580005Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5580085Z     
2025-05-07T20:33:22.5580290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5582075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5582084Z 
2025-05-07T20:33:22.5582199Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5582206Z 
2025-05-07T20:33:22.5582308Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5582532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5582617Z     T=2048,
2025-05-07T20:33:22.5582695Z     D=5120,
2025-05-07T20:33:22.5582780Z     scale_ub=1200.0,
2025-05-07T20:33:22.5582869Z     contiguous=False,
2025-05-07T20:33:22.5582954Z     compiled=False,
2025-05-07T20:33:22.5583027Z )
2025-05-07T20:33:22.5583247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5583420Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5583424Z 
2025-05-07T20:33:22.5583506Z     @given(
2025-05-07T20:33:22.5583625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5583723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5583840Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5583955Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5584068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5584145Z     )
2025-05-07T20:33:22.5584392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5584491Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5584573Z         self,
2025-05-07T20:33:22.5584650Z         T: int,
2025-05-07T20:33:22.5584729Z         D: int,
2025-05-07T20:33:22.5584828Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5584916Z         contiguous: bool,
2025-05-07T20:33:22.5585005Z         compiled: bool,
2025-05-07T20:33:22.5585085Z     ) -> None:
2025-05-07T20:33:22.5585181Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5585259Z     
2025-05-07T20:33:22.5585425Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5587248Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5587328Z 
2025-05-07T20:33:22.5587445Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5587449Z 
2025-05-07T20:33:22.5587553Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5587775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5587853Z     T=4096,
2025-05-07T20:33:22.5587931Z     D=7168,
2025-05-07T20:33:22.5588020Z     scale_ub=1200.0,
2025-05-07T20:33:22.5588104Z     contiguous=True,
2025-05-07T20:33:22.5588191Z     compiled=False,
2025-05-07T20:33:22.5588263Z )
2025-05-07T20:33:22.5588478Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5588655Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5588662Z 
2025-05-07T20:33:22.5588742Z     @given(
2025-05-07T20:33:22.5588924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5589024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5589147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5589262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5589375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5589454Z     )
2025-05-07T20:33:22.5589696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5589792Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5589868Z         self,
2025-05-07T20:33:22.5589945Z         T: int,
2025-05-07T20:33:22.5590024Z         D: int,
2025-05-07T20:33:22.5590121Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5590209Z         contiguous: bool,
2025-05-07T20:33:22.5590298Z         compiled: bool,
2025-05-07T20:33:22.5590379Z     ) -> None:
2025-05-07T20:33:22.5590475Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5590553Z     
2025-05-07T20:33:22.5590720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5592507Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5592513Z 
2025-05-07T20:33:22.5592628Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5592632Z 
2025-05-07T20:33:22.5592735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5592961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5593042Z     T=16384,
2025-05-07T20:33:22.5593124Z     D=7168,
2025-05-07T20:33:22.5593205Z     scale_ub=None,
2025-05-07T20:33:22.5593292Z     contiguous=False,
2025-05-07T20:33:22.5593376Z     compiled=True,
2025-05-07T20:33:22.5593448Z )
2025-05-07T20:33:22.5593662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5593839Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.5593843Z 
2025-05-07T20:33:22.5593921Z     @given(
2025-05-07T20:33:22.5594038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5594138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5594251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5594370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5594482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5594638Z     )
2025-05-07T20:33:22.5594886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5594978Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5595099Z         self,
2025-05-07T20:33:22.5595178Z         T: int,
2025-05-07T20:33:22.5595256Z         D: int,
2025-05-07T20:33:22.5595352Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5595443Z         contiguous: bool,
2025-05-07T20:33:22.5595527Z         compiled: bool,
2025-05-07T20:33:22.5595608Z     ) -> None:
2025-05-07T20:33:22.5595701Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5595827Z     
2025-05-07T20:33:22.5595998Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5597825Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5597836Z 
2025-05-07T20:33:22.5597954Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5597959Z 
2025-05-07T20:33:22.5598058Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5598278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5598356Z     T=4096,
2025-05-07T20:33:22.5598432Z     D=7168,
2025-05-07T20:33:22.5598514Z     scale_ub=None,
2025-05-07T20:33:22.5598600Z     contiguous=True,
2025-05-07T20:33:22.5598683Z     compiled=False,
2025-05-07T20:33:22.5598757Z )
2025-05-07T20:33:22.5598975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5599149Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5599155Z 
2025-05-07T20:33:22.5599235Z     @given(
2025-05-07T20:33:22.5599352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5599453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5599570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5599684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5599795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5599870Z     )
2025-05-07T20:33:22.5600111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5600208Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5600285Z         self,
2025-05-07T20:33:22.5600361Z         T: int,
2025-05-07T20:33:22.5600440Z         D: int,
2025-05-07T20:33:22.5600536Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5600625Z         contiguous: bool,
2025-05-07T20:33:22.5600718Z         compiled: bool,
2025-05-07T20:33:22.5600795Z     ) -> None:
2025-05-07T20:33:22.5600891Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5600968Z     
2025-05-07T20:33:22.5601132Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5602917Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5604812Z 
2025-05-07T20:33:22.5604931Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5605190Z 
2025-05-07T20:33:22.5605331Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5605749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5606155Z     T=16384,
2025-05-07T20:33:22.5606387Z     D=7168,
2025-05-07T20:33:22.5606582Z     scale_ub=None,
2025-05-07T20:33:22.5606800Z     contiguous=True,
2025-05-07T20:33:22.5607022Z     compiled=False,
2025-05-07T20:33:22.5607222Z )
2025-05-07T20:33:22.5607542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5608039Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:22.5608315Z 
2025-05-07T20:33:22.5608407Z     @given(
2025-05-07T20:33:22.5608643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5608957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5609264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5609591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5609923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5610255Z     )
2025-05-07T20:33:22.5610603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5611043Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5611285Z         self,
2025-05-07T20:33:22.5611479Z         T: int,
2025-05-07T20:33:22.5611670Z         D: int,
2025-05-07T20:33:22.5611885Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5612155Z         contiguous: bool,
2025-05-07T20:33:22.5612387Z         compiled: bool,
2025-05-07T20:33:22.5612608Z     ) -> None:
2025-05-07T20:33:22.5612819Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5613054Z     
2025-05-07T20:33:22.5613326Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5615381Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5617253Z 
2025-05-07T20:33:22.5617370Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5617581Z 
2025-05-07T20:33:22.5617686Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5618094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5618497Z     T=16384,
2025-05-07T20:33:22.5618689Z     D=7168,
2025-05-07T20:33:22.5618877Z     scale_ub=1200.0,
2025-05-07T20:33:22.5619098Z     contiguous=True,
2025-05-07T20:33:22.5619317Z     compiled=False,
2025-05-07T20:33:22.5619520Z )
2025-05-07T20:33:22.5619838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5620336Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5620614Z 
2025-05-07T20:33:22.5620702Z     @given(
2025-05-07T20:33:22.5620924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5621239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5621544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5621870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5622200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5622485Z     )
2025-05-07T20:33:22.5622838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5623274Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5623515Z         self,
2025-05-07T20:33:22.5623707Z         T: int,
2025-05-07T20:33:22.5623898Z         D: int,
2025-05-07T20:33:22.5624171Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5624477Z         contiguous: bool,
2025-05-07T20:33:22.5624714Z         compiled: bool,
2025-05-07T20:33:22.5624936Z     ) -> None:
2025-05-07T20:33:22.5625149Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5625426Z     
2025-05-07T20:33:22.5625695Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5627748Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5629614Z 
2025-05-07T20:33:22.5629737Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5629947Z 
2025-05-07T20:33:22.5630093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5630502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5630908Z     T=128,
2025-05-07T20:33:22.5631096Z     D=5120,
2025-05-07T20:33:22.5631283Z     scale_ub=1200.0,
2025-05-07T20:33:22.5631504Z     contiguous=False,
2025-05-07T20:33:22.5631728Z     compiled=False,
2025-05-07T20:33:22.5631928Z )
2025-05-07T20:33:22.5632244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5632735Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:22.5633008Z 
2025-05-07T20:33:22.5633091Z     @given(
2025-05-07T20:33:22.5633313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5633625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5633932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5634258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5634588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5634871Z     )
2025-05-07T20:33:22.5635215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5635654Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5635943Z         self,
2025-05-07T20:33:22.5636136Z         T: int,
2025-05-07T20:33:22.5636333Z         D: int,
2025-05-07T20:33:22.5636549Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5636814Z         contiguous: bool,
2025-05-07T20:33:22.5637052Z         compiled: bool,
2025-05-07T20:33:22.5637273Z     ) -> None:
2025-05-07T20:33:22.5637485Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5637723Z     
2025-05-07T20:33:22.5637993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5638334Z     
2025-05-07T20:33:22.5638523Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5638819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5639135Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5639373Z         x0 = x[:, :D]
2025-05-07T20:33:22.5639589Z         x1 = x[:, D:]
2025-05-07T20:33:22.5639795Z     
2025-05-07T20:33:22.5639975Z         if contiguous:
2025-05-07T20:33:22.5640209Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5640464Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5640701Z     
2025-05-07T20:33:22.5640895Z         if scale_ub is not None:
2025-05-07T20:33:22.5641169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5641497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5641810Z             )
2025-05-07T20:33:22.5642000Z         else:
2025-05-07T20:33:22.5642207Z             scale_ub_tensor = None
2025-05-07T20:33:22.5642456Z     
2025-05-07T20:33:22.5642685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5643050Z             op = silu_mul_quant
2025-05-07T20:33:22.5643334Z             if compiled:
2025-05-07T20:33:22.5643581Z                 op = torch.compile(op)
2025-05-07T20:33:22.5643877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5644213Z     
2025-05-07T20:33:22.5644410Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5644573Z 
2025-05-07T20:33:22.5644674Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5644964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5645298Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5645578Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5646265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5646954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5647490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5648179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5648888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5649426Z     kernel = self.compile(
2025-05-07T20:33:22.5649963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5650616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5651008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5651244Z 
2025-05-07T20:33:22.5651451Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cee9f200>
2025-05-07T20:33:22.5652538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5653914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32ced287c0>}
2025-05-07T20:33:22.5655253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5656277Z context = <triton._C.libtriton.ir.context object at 0x7f32ced49430>
2025-05-07T20:33:22.5656567Z 
2025-05-07T20:33:22.5656733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5657256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5657721Z                            module_map=module_map)
2025-05-07T20:33:22.5658090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5658451Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5658710Z E       ^
2025-05-07T20:33:22.5659179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5659661Z 
2025-05-07T20:33:22.5660099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5660609Z 
2025-05-07T20:33:22.5660717Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5661127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5661530Z     T=2048,
2025-05-07T20:33:22.5661718Z     D=7168,
2025-05-07T20:33:22.5661908Z     scale_ub=None,
2025-05-07T20:33:22.5662118Z     contiguous=False,
2025-05-07T20:33:22.5662342Z     compiled=False,
2025-05-07T20:33:22.5662543Z )
2025-05-07T20:33:22.5662856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5663408Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.5663720Z 
2025-05-07T20:33:22.5663805Z     @given(
2025-05-07T20:33:22.5664032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5664387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5664694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5665020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5665614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5665948Z     )
2025-05-07T20:33:22.5666299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5666737Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5666978Z         self,
2025-05-07T20:33:22.5667170Z         T: int,
2025-05-07T20:33:22.5667363Z         D: int,
2025-05-07T20:33:22.5667579Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5667847Z         contiguous: bool,
2025-05-07T20:33:22.5668080Z         compiled: bool,
2025-05-07T20:33:22.5668308Z     ) -> None:
2025-05-07T20:33:22.5668603Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5668841Z     
2025-05-07T20:33:22.5669111Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5671216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5673079Z 
2025-05-07T20:33:22.5673198Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5673410Z 
2025-05-07T20:33:22.5673520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5673932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5674333Z     T=128,
2025-05-07T20:33:22.5674522Z     D=7168,
2025-05-07T20:33:22.5674712Z     scale_ub=1200.0,
2025-05-07T20:33:22.5674934Z     contiguous=True,
2025-05-07T20:33:22.5675152Z     compiled=True,
2025-05-07T20:33:22.5675350Z )
2025-05-07T20:33:22.5675666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5676200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5676468Z 
2025-05-07T20:33:22.5676550Z     @given(
2025-05-07T20:33:22.5676778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5677086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5677389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5677711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5678039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5678329Z     )
2025-05-07T20:33:22.5678676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5679121Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5679361Z         self,
2025-05-07T20:33:22.5679548Z         T: int,
2025-05-07T20:33:22.5679745Z         D: int,
2025-05-07T20:33:22.5679964Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5680227Z         contiguous: bool,
2025-05-07T20:33:22.5680463Z         compiled: bool,
2025-05-07T20:33:22.5680681Z     ) -> None:
2025-05-07T20:33:22.5680894Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5681126Z     
2025-05-07T20:33:22.5681396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5686249Z     
2025-05-07T20:33:22.5686479Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5686773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5687091Z         x = x_sign * x_clamp
2025-05-07T20:33:22.5687503Z         x0 = x[:, :D]
2025-05-07T20:33:22.5687725Z         x1 = x[:, D:]
2025-05-07T20:33:22.5687932Z     
2025-05-07T20:33:22.5688120Z         if contiguous:
2025-05-07T20:33:22.5688411Z             x0 = x0.contiguous()
2025-05-07T20:33:22.5688673Z             x1 = x1.contiguous()
2025-05-07T20:33:22.5688917Z     
2025-05-07T20:33:22.5689116Z         if scale_ub is not None:
2025-05-07T20:33:22.5689384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.5689719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.5690031Z             )
2025-05-07T20:33:22.5690219Z         else:
2025-05-07T20:33:22.5690435Z             scale_ub_tensor = None
2025-05-07T20:33:22.5690687Z     
2025-05-07T20:33:22.5690921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.5691237Z             op = silu_mul_quant
2025-05-07T20:33:22.5691485Z             if compiled:
2025-05-07T20:33:22.5691727Z                 op = torch.compile(op)
2025-05-07T20:33:22.5692029Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5692353Z     
2025-05-07T20:33:22.5692549Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.5692718Z 
2025-05-07T20:33:22.5692822Z moe/activation_test.py:117: 
2025-05-07T20:33:22.5693120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5693454Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.5693732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.5694295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:22.5694858Z     return fn(*args, **kwargs)
2025-05-07T20:33:22.5695514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.5696195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.5696732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.5697416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.5698076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.5698610Z     kernel = self.compile(
2025-05-07T20:33:22.5699148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.5699825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.5700250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.5700486Z 
2025-05-07T20:33:22.5700694Z self = <triton.compiler.compiler.ASTSource object at 0x7f32cec03e90>
2025-05-07T20:33:22.5701782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.5703158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3439c1d440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32ced29940>}
2025-05-07T20:33:22.5704495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.5705523Z context = <triton._C.libtriton.ir.context object at 0x7f32cec371b0>
2025-05-07T20:33:22.5705814Z 
2025-05-07T20:33:22.5705979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.5706502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.5706963Z                            module_map=module_map)
2025-05-07T20:33:22.5707381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.5707774Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.5708030Z E       ^
2025-05-07T20:33:22.5708496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.5708992Z 
2025-05-07T20:33:22.5709406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.5709953Z 
2025-05-07T20:33:22.5710070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5710479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5710882Z     T=128,
2025-05-07T20:33:22.5711073Z     D=7168,
2025-05-07T20:33:22.5711265Z     scale_ub=1200.0,
2025-05-07T20:33:22.5711492Z     contiguous=True,
2025-05-07T20:33:22.5711716Z     compiled=False,
2025-05-07T20:33:22.5711917Z )
2025-05-07T20:33:22.5712240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5712786Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.5713057Z 
2025-05-07T20:33:22.5713137Z     @given(
2025-05-07T20:33:22.5713366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5713683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5713989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5714318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5714643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5714928Z     )
2025-05-07T20:33:22.5715269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5715782Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5716024Z         self,
2025-05-07T20:33:22.5716212Z         T: int,
2025-05-07T20:33:22.5716412Z         D: int,
2025-05-07T20:33:22.5716627Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5716905Z         contiguous: bool,
2025-05-07T20:33:22.5717139Z         compiled: bool,
2025-05-07T20:33:22.5717362Z     ) -> None:
2025-05-07T20:33:22.5717572Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5717806Z     
2025-05-07T20:33:22.5718079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5718421Z     
2025-05-07T20:33:22.5718609Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5718904Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5720916Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5722776Z 
2025-05-07T20:33:22.5722897Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.5723108Z 
2025-05-07T20:33:22.5723217Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5723621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5724026Z     T=128,
2025-05-07T20:33:22.5724214Z     D=5120,
2025-05-07T20:33:22.5724404Z     scale_ub=1200.0,
2025-05-07T20:33:22.5724628Z     contiguous=True,
2025-05-07T20:33:22.5724845Z     compiled=True,
2025-05-07T20:33:22.5725041Z )
2025-05-07T20:33:22.5725358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5725844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.5726112Z 
2025-05-07T20:33:22.5726190Z     @given(
2025-05-07T20:33:22.5726417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5726821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5727130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5727452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5727818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5728104Z     )
2025-05-07T20:33:22.5728446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5728891Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5729136Z         self,
2025-05-07T20:33:22.5729322Z         T: int,
2025-05-07T20:33:22.5729518Z         D: int,
2025-05-07T20:33:22.5729739Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5730006Z         contiguous: bool,
2025-05-07T20:33:22.5730243Z         compiled: bool,
2025-05-07T20:33:22.5730464Z     ) -> None:
2025-05-07T20:33:22.5730674Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5730917Z     
2025-05-07T20:33:22.5731189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5731530Z     
2025-05-07T20:33:22.5731765Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.5732052Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.5734057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5735913Z 
2025-05-07T20:33:22.5736030Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:22.5736241Z 
2025-05-07T20:33:22.5736350Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.5736764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.5737166Z     T=128,
2025-05-07T20:33:22.5737355Z     D=7168,
2025-05-07T20:33:22.5737544Z     scale_ub=None,
2025-05-07T20:33:22.5737760Z     contiguous=True,
2025-05-07T20:33:22.5737978Z     compiled=True,
2025-05-07T20:33:22.5738172Z )
2025-05-07T20:33:22.5738486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.5738969Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.5739231Z 
2025-05-07T20:33:22.5739310Z     @given(
2025-05-07T20:33:22.5739535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.5739875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.5740200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.5740525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.5740848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.5741138Z     )
2025-05-07T20:33:22.5741483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.5741920Z     def test_silu_mul_quant(
2025-05-07T20:33:22.5742162Z         self,
2025-05-07T20:33:22.5742350Z         T: int,
2025-05-07T20:33:22.5742542Z         D: int,
2025-05-07T20:33:22.5742755Z         scale_ub: Optional[float],
2025-05-07T20:33:22.5743020Z         contiguous: bool,
2025-05-07T20:33:22.5743255Z         compiled: bool,
2025-05-07T20:33:22.5743474Z     ) -> None:
2025-05-07T20:33:22.5743687Z         torch.manual_seed(2025)
2025-05-07T20:33:22.5743921Z     
2025-05-07T20:33:22.5744189Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.5746280Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:22.5748229Z 
2025-05-07T20:33:22.5748351Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:22.5748689Z =============================== warnings summary ===============================
2025-05-07T20:33:22.5749233Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:22.5749953Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:22.5750666Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:22.5751974Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:22.5753171Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:22.5753501Z 
2025-05-07T20:33:22.5753708Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:22.5754177Z ================= 1 failed, 1 deselected, 3 warnings in 13.14s =================
2025-05-07T20:33:24.1453583Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:24.2071552Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:24.2071781Z 
2025-05-07T20:33:26.2089524Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:28.3661002Z ============================= test session starts ==============================
2025-05-07T20:33:28.3661666Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:28.3662190Z cachedir: .pytest_cache
2025-05-07T20:33:28.3662882Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:28.3664339Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:28.3665140Z plugins: hypothesis-6.131.14
2025-05-07T20:33:29.9804655Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:30.0883969Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:30.0884387Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:30.0884623Z 
2025-05-07T20:33:32.4543815Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.4545300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.4545731Z     T=1,
2025-05-07T20:33:32.4545918Z     D=5120,
2025-05-07T20:33:32.4546117Z     scale_ub=None,
2025-05-07T20:33:32.4546333Z     contiguous=True,
2025-05-07T20:33:32.4546552Z     compiled=True,
2025-05-07T20:33:32.4546765Z )
2025-05-07T20:33:32.4547093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.4547584Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.4547846Z 
2025-05-07T20:33:32.4547926Z     @given(
2025-05-07T20:33:32.4548161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.4548478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.4548780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.4549557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.4549898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.4550183Z     )
2025-05-07T20:33:32.4550656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.4551111Z     def test_silu_mul_quant(
2025-05-07T20:33:32.4551359Z         self,
2025-05-07T20:33:32.4551557Z         T: int,
2025-05-07T20:33:32.4551761Z         D: int,
2025-05-07T20:33:32.4551981Z         scale_ub: Optional[float],
2025-05-07T20:33:32.4552252Z         contiguous: bool,
2025-05-07T20:33:32.4552501Z         compiled: bool,
2025-05-07T20:33:32.4552736Z     ) -> None:
2025-05-07T20:33:32.4552954Z         torch.manual_seed(2025)
2025-05-07T20:33:32.4553202Z     
2025-05-07T20:33:32.4553480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.4553828Z     
2025-05-07T20:33:32.4554029Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.4554331Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.4554740Z         x = x_sign * x_clamp
2025-05-07T20:33:32.4554996Z         x0 = x[:, :D]
2025-05-07T20:33:32.4555220Z         x1 = x[:, D:]
2025-05-07T20:33:32.4555431Z     
2025-05-07T20:33:32.4555626Z         if contiguous:
2025-05-07T20:33:32.4555993Z             x0 = x0.contiguous()
2025-05-07T20:33:32.4556253Z             x1 = x1.contiguous()
2025-05-07T20:33:32.4556500Z     
2025-05-07T20:33:32.4556703Z         if scale_ub is not None:
2025-05-07T20:33:32.4556982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.4557320Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.4557640Z             )
2025-05-07T20:33:32.4557839Z         else:
2025-05-07T20:33:32.4558051Z             scale_ub_tensor = None
2025-05-07T20:33:32.4558310Z     
2025-05-07T20:33:32.4558553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.4558870Z             op = silu_mul_quant
2025-05-07T20:33:32.4559131Z             if compiled:
2025-05-07T20:33:32.4559387Z                 op = torch.compile(op)
2025-05-07T20:33:32.4559684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.4559966Z     
2025-05-07T20:33:32.4560164Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.4560449Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.4560746Z     
2025-05-07T20:33:32.4560992Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.4561326Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.4561623Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.4561942Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.4562309Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.4562621Z     
2025-05-07T20:33:32.4562829Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.4563025Z 
2025-05-07T20:33:32.4563133Z moe/activation_test.py:126: 
2025-05-07T20:33:32.4563437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.4563781Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.4564116Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.4564915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.4565967Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.4566518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.4567206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.4567893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.4568621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.4569546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.4570191Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.4570850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.4571373Z     fn()
2025-05-07T20:33:32.4571887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.4572473Z     self.fn.run(
2025-05-07T20:33:32.4572940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.4573478Z     kernel = self.compile(
2025-05-07T20:33:32.4574022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.4574681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.4575151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.4575386Z 
2025-05-07T20:33:32.4575606Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb3eb2a80>
2025-05-07T20:33:32.4576701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.4578096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb2735c60>}
2025-05-07T20:33:32.4579445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.4580485Z context = <triton._C.libtriton.ir.context object at 0x7f2cb29777b0>
2025-05-07T20:33:32.4580777Z 
2025-05-07T20:33:32.4580961Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.4581488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.4581968Z                            module_map=module_map)
2025-05-07T20:33:32.4582338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.4582703Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.4582971Z E       ^
2025-05-07T20:33:32.4583442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.4583896Z 
2025-05-07T20:33:32.4584318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.4584833Z 
2025-05-07T20:33:32.4584960Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.4585415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.4585824Z     T=2048,
2025-05-07T20:33:32.4586018Z     D=5120,
2025-05-07T20:33:32.4586212Z     scale_ub=1200.0,
2025-05-07T20:33:32.4586442Z     contiguous=True,
2025-05-07T20:33:32.4586669Z     compiled=False,
2025-05-07T20:33:32.4586874Z )
2025-05-07T20:33:33.1877701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.1878490Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.1878866Z 
2025-05-07T20:33:33.1878985Z     @given(
2025-05-07T20:33:33.1879303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.1879679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.1879991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.1880315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.1880937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.1881304Z     )
2025-05-07T20:33:33.1881653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.1882097Z     def test_silu_mul_quant(
2025-05-07T20:33:33.1882427Z         self,
2025-05-07T20:33:33.1882623Z         T: int,
2025-05-07T20:33:33.1882821Z         D: int,
2025-05-07T20:33:33.1883042Z         scale_ub: Optional[float],
2025-05-07T20:33:33.1883316Z         contiguous: bool,
2025-05-07T20:33:33.1883558Z         compiled: bool,
2025-05-07T20:33:33.1883790Z     ) -> None:
2025-05-07T20:33:33.1884014Z         torch.manual_seed(2025)
2025-05-07T20:33:33.1884249Z     
2025-05-07T20:33:33.1884521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.1884865Z     
2025-05-07T20:33:33.1885053Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.1885394Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.1885704Z         x = x_sign * x_clamp
2025-05-07T20:33:33.1885945Z         x0 = x[:, :D]
2025-05-07T20:33:33.1886163Z         x1 = x[:, D:]
2025-05-07T20:33:33.1886452Z     
2025-05-07T20:33:33.1886636Z         if contiguous:
2025-05-07T20:33:33.1886869Z             x0 = x0.contiguous()
2025-05-07T20:33:33.1887131Z             x1 = x1.contiguous()
2025-05-07T20:33:33.1887367Z     
2025-05-07T20:33:33.1887561Z         if scale_ub is not None:
2025-05-07T20:33:33.1887835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.1888169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.1888473Z             )
2025-05-07T20:33:33.1888667Z         else:
2025-05-07T20:33:33.1888885Z             scale_ub_tensor = None
2025-05-07T20:33:33.1889132Z     
2025-05-07T20:33:33.1889363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1889682Z             op = silu_mul_quant
2025-05-07T20:33:33.1889924Z             if compiled:
2025-05-07T20:33:33.1890177Z                 op = torch.compile(op)
2025-05-07T20:33:33.1890483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1890754Z     
2025-05-07T20:33:33.1890954Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.1891122Z 
2025-05-07T20:33:33.1891229Z moe/activation_test.py:117: 
2025-05-07T20:33:33.1891521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1891857Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.1892141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1892832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.1893513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.1894054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.1894735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.1895397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.1895935Z     kernel = self.compile(
2025-05-07T20:33:33.1896475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.1897131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.1897523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1897763Z 
2025-05-07T20:33:33.1897970Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb2740980>
2025-05-07T20:33:33.1899056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.1900497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb258c220>}
2025-05-07T20:33:33.1901883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.1902946Z context = <triton._C.libtriton.ir.context object at 0x7f2cb297b170>
2025-05-07T20:33:33.1903251Z 
2025-05-07T20:33:33.1903418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.1903940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.1910516Z                            module_map=module_map)
2025-05-07T20:33:33.1910894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.1911255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.1911518Z E       ^
2025-05-07T20:33:33.1911987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.1912448Z 
2025-05-07T20:33:33.1912947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.1913474Z 
2025-05-07T20:33:33.1913582Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.1913998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.1914397Z     T=2048,
2025-05-07T20:33:33.1914594Z     D=5120,
2025-05-07T20:33:33.1914799Z     scale_ub=1200.0,
2025-05-07T20:33:33.1915022Z     contiguous=True,
2025-05-07T20:33:33.1915251Z     compiled=True,
2025-05-07T20:33:33.1915467Z )
2025-05-07T20:33:33.1915878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.1916379Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.1916650Z 
2025-05-07T20:33:33.1916731Z     @given(
2025-05-07T20:33:33.1916974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.1917306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.1917615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.1917953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.1918290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.1918580Z     )
2025-05-07T20:33:33.1918935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.1919386Z     def test_silu_mul_quant(
2025-05-07T20:33:33.1919638Z         self,
2025-05-07T20:33:33.1919834Z         T: int,
2025-05-07T20:33:33.1920037Z         D: int,
2025-05-07T20:33:33.1920259Z         scale_ub: Optional[float],
2025-05-07T20:33:33.1920535Z         contiguous: bool,
2025-05-07T20:33:33.1920780Z         compiled: bool,
2025-05-07T20:33:33.1921011Z     ) -> None:
2025-05-07T20:33:33.1921226Z         torch.manual_seed(2025)
2025-05-07T20:33:33.1921474Z     
2025-05-07T20:33:33.1921759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.1922101Z     
2025-05-07T20:33:33.1922305Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.1922600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.1922911Z         x = x_sign * x_clamp
2025-05-07T20:33:33.1923154Z         x0 = x[:, :D]
2025-05-07T20:33:33.1923381Z         x1 = x[:, D:]
2025-05-07T20:33:33.1923589Z     
2025-05-07T20:33:33.1923782Z         if contiguous:
2025-05-07T20:33:33.1924020Z             x0 = x0.contiguous()
2025-05-07T20:33:33.1924285Z             x1 = x1.contiguous()
2025-05-07T20:33:33.1924523Z     
2025-05-07T20:33:33.1924727Z         if scale_ub is not None:
2025-05-07T20:33:33.1925004Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.1925334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.1925655Z             )
2025-05-07T20:33:33.1925854Z         else:
2025-05-07T20:33:33.1926126Z             scale_ub_tensor = None
2025-05-07T20:33:33.1926427Z     
2025-05-07T20:33:33.1926667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1926981Z             op = silu_mul_quant
2025-05-07T20:33:33.1927280Z             if compiled:
2025-05-07T20:33:33.1927532Z                 op = torch.compile(op)
2025-05-07T20:33:33.1927827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1928105Z     
2025-05-07T20:33:33.1928307Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.1928590Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.1928884Z     
2025-05-07T20:33:33.1929129Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1929470Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.1929764Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.1930088Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.1930452Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.1930771Z     
2025-05-07T20:33:33.1931027Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.1931227Z 
2025-05-07T20:33:33.1931343Z moe/activation_test.py:126: 
2025-05-07T20:33:33.1931644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1931993Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.1932329Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.1933130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.1933875Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.1934424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.1935108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.1935797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.1936530Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.1937267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.1937907Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.1938506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.1939030Z     fn()
2025-05-07T20:33:33.1939547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.1940136Z     self.fn.run(
2025-05-07T20:33:33.1940599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.1941139Z     kernel = self.compile(
2025-05-07T20:33:33.1941690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.1942360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.1942761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1942997Z 
2025-05-07T20:33:33.1943206Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0f13650>
2025-05-07T20:33:33.1944296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.1945726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb258d6c0>}
2025-05-07T20:33:33.1947113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.1948185Z context = <triton._C.libtriton.ir.context object at 0x7f2cb115edf0>
2025-05-07T20:33:33.1948520Z 
2025-05-07T20:33:33.1948687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.1949215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.1949680Z                            module_map=module_map)
2025-05-07T20:33:33.1950062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.1950422Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.1950687Z E       ^
2025-05-07T20:33:33.1951152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.1951608Z 
2025-05-07T20:33:33.1952025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.1952579Z 
2025-05-07T20:33:33.1952696Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.1953113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.1953525Z     T=16384,
2025-05-07T20:33:33.1953732Z     D=7168,
2025-05-07T20:33:33.1953925Z     scale_ub=1200.0,
2025-05-07T20:33:33.1954162Z     contiguous=False,
2025-05-07T20:33:33.1954396Z     compiled=False,
2025-05-07T20:33:33.1954604Z )
2025-05-07T20:33:33.9233069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.9233863Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.9234270Z 
2025-05-07T20:33:33.9234386Z     @given(
2025-05-07T20:33:33.9234715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.9235161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.9235505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.9235921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.9236263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.9236566Z     )
2025-05-07T20:33:33.9236921Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.9237375Z     def test_silu_mul_quant(
2025-05-07T20:33:33.9237632Z         self,
2025-05-07T20:33:33.9237832Z         T: int,
2025-05-07T20:33:33.9238039Z         D: int,
2025-05-07T20:33:33.9238268Z         scale_ub: Optional[float],
2025-05-07T20:33:33.9238552Z         contiguous: bool,
2025-05-07T20:33:33.9238791Z         compiled: bool,
2025-05-07T20:33:33.9239026Z     ) -> None:
2025-05-07T20:33:33.9239250Z         torch.manual_seed(2025)
2025-05-07T20:33:33.9239497Z     
2025-05-07T20:33:33.9239778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.9240132Z     
2025-05-07T20:33:33.9240332Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.9240628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.9240945Z         x = x_sign * x_clamp
2025-05-07T20:33:33.9241183Z         x0 = x[:, :D]
2025-05-07T20:33:33.9241419Z         x1 = x[:, D:]
2025-05-07T20:33:33.9241634Z     
2025-05-07T20:33:33.9241826Z         if contiguous:
2025-05-07T20:33:33.9242057Z             x0 = x0.contiguous()
2025-05-07T20:33:33.9242351Z             x1 = x1.contiguous()
2025-05-07T20:33:33.9242597Z     
2025-05-07T20:33:33.9242795Z         if scale_ub is not None:
2025-05-07T20:33:33.9243067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.9243406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.9243724Z             )
2025-05-07T20:33:33.9243922Z         else:
2025-05-07T20:33:33.9244133Z             scale_ub_tensor = None
2025-05-07T20:33:33.9244390Z     
2025-05-07T20:33:33.9244628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.9245178Z             op = silu_mul_quant
2025-05-07T20:33:33.9245439Z             if compiled:
2025-05-07T20:33:33.9245693Z                 op = torch.compile(op)
2025-05-07T20:33:33.9245986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.9246342Z     
2025-05-07T20:33:33.9246544Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.9246709Z 
2025-05-07T20:33:33.9246810Z moe/activation_test.py:117: 
2025-05-07T20:33:33.9247116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.9247455Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.9247734Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.9248436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.9249130Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.9249677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.9250443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.9251116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.9251657Z     kernel = self.compile(
2025-05-07T20:33:33.9252204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.9252858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.9253263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.9253496Z 
2025-05-07T20:33:33.9253712Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb25da8a0>
2025-05-07T20:33:33.9254807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.9256198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb1450a40>}
2025-05-07T20:33:33.9257548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.9258587Z context = <triton._C.libtriton.ir.context object at 0x7f2cb071d530>
2025-05-07T20:33:33.9258878Z 
2025-05-07T20:33:33.9259052Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.9259576Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.9260059Z                            module_map=module_map)
2025-05-07T20:33:33.9260430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.9260802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.9261068Z E       ^
2025-05-07T20:33:33.9261538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.9261993Z 
2025-05-07T20:33:33.9262415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.9262924Z 
2025-05-07T20:33:33.9263031Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.9263452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.9263860Z     T=1,
2025-05-07T20:33:33.9264050Z     D=7168,
2025-05-07T20:33:33.9264255Z     scale_ub=None,
2025-05-07T20:33:33.9264474Z     contiguous=True,
2025-05-07T20:33:33.9264701Z     compiled=True,
2025-05-07T20:33:33.9264907Z )
2025-05-07T20:33:33.9265237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.9266053Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.9266316Z 
2025-05-07T20:33:33.9266396Z     @given(
2025-05-07T20:33:33.9266631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.9267011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.9267316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.9267650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.9267989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.9268279Z     )
2025-05-07T20:33:33.9268626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.9269079Z     def test_silu_mul_quant(
2025-05-07T20:33:33.9269329Z         self,
2025-05-07T20:33:33.9269526Z         T: int,
2025-05-07T20:33:33.9269730Z         D: int,
2025-05-07T20:33:33.9269958Z         scale_ub: Optional[float],
2025-05-07T20:33:33.9270227Z         contiguous: bool,
2025-05-07T20:33:33.9270480Z         compiled: bool,
2025-05-07T20:33:33.9270711Z     ) -> None:
2025-05-07T20:33:33.9270989Z         torch.manual_seed(2025)
2025-05-07T20:33:33.9271238Z     
2025-05-07T20:33:33.9271519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.9271866Z     
2025-05-07T20:33:33.9272069Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.9272368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.9272691Z         x = x_sign * x_clamp
2025-05-07T20:33:33.9272932Z         x0 = x[:, :D]
2025-05-07T20:33:33.9273158Z         x1 = x[:, D:]
2025-05-07T20:33:33.9273373Z     
2025-05-07T20:33:33.9273562Z         if contiguous:
2025-05-07T20:33:33.9273798Z             x0 = x0.contiguous()
2025-05-07T20:33:33.9274066Z             x1 = x1.contiguous()
2025-05-07T20:33:33.9274312Z     
2025-05-07T20:33:33.9274512Z         if scale_ub is not None:
2025-05-07T20:33:33.9274793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.9275137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.9275456Z             )
2025-05-07T20:33:33.9275711Z         else:
2025-05-07T20:33:33.9275923Z             scale_ub_tensor = None
2025-05-07T20:33:33.9276195Z     
2025-05-07T20:33:33.9276435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.9276750Z             op = silu_mul_quant
2025-05-07T20:33:33.9277004Z             if compiled:
2025-05-07T20:33:33.9277264Z                 op = torch.compile(op)
2025-05-07T20:33:33.9277559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.9277842Z     
2025-05-07T20:33:33.9278042Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.9278334Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.9278623Z     
2025-05-07T20:33:33.9278867Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.9279209Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.9279501Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.9279829Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.9280200Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.9280517Z     
2025-05-07T20:33:33.9280732Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.9280934Z 
2025-05-07T20:33:33.9281042Z moe/activation_test.py:126: 
2025-05-07T20:33:33.9281345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.9281684Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.9282024Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.9282813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.9283569Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.9284125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.9284923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.9285621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.9286387Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.9287122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.9287769Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.9288372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.9288894Z     fn()
2025-05-07T20:33:33.9289409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.9289998Z     self.fn.run(
2025-05-07T20:33:33.9290516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.9291057Z     kernel = self.compile(
2025-05-07T20:33:33.9291603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.9292255Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.9292663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.9292903Z 
2025-05-07T20:33:33.9293110Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1405520>
2025-05-07T20:33:33.9294203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.9295584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb1450360>}
2025-05-07T20:33:33.9296926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.9297961Z context = <triton._C.libtriton.ir.context object at 0x7f2cb09fbdb0>
2025-05-07T20:33:33.9298257Z 
2025-05-07T20:33:33.9298425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.9298960Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.9299437Z                            module_map=module_map)
2025-05-07T20:33:33.9299813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.9300185Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.9300453Z E       ^
2025-05-07T20:33:33.9300927Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.9301392Z 
2025-05-07T20:33:33.9301807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.9302319Z 
2025-05-07T20:33:33.9302431Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.9302842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.9303251Z     T=4096,
2025-05-07T20:33:33.9303447Z     D=5120,
2025-05-07T20:33:33.9303639Z     scale_ub=None,
2025-05-07T20:33:33.9303862Z     contiguous=False,
2025-05-07T20:33:33.9304091Z     compiled=False,
2025-05-07T20:33:33.9304293Z )
2025-05-07T20:33:34.7254418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7254999Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.7255398Z 
2025-05-07T20:33:34.7255491Z     @given(
2025-05-07T20:33:34.7255954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7256272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7256585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7256983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7257314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7257606Z     )
2025-05-07T20:33:34.7257962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7258400Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7258648Z         self,
2025-05-07T20:33:34.7258843Z         T: int,
2025-05-07T20:33:34.7259035Z         D: int,
2025-05-07T20:33:34.7259252Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7259524Z         contiguous: bool,
2025-05-07T20:33:34.7259768Z         compiled: bool,
2025-05-07T20:33:34.7259989Z     ) -> None:
2025-05-07T20:33:34.7260210Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7260462Z     
2025-05-07T20:33:34.7260795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7261142Z     
2025-05-07T20:33:34.7261341Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7261628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7261946Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7262188Z         x0 = x[:, :D]
2025-05-07T20:33:34.7262399Z         x1 = x[:, D:]
2025-05-07T20:33:34.7262608Z     
2025-05-07T20:33:34.7262796Z         if contiguous:
2025-05-07T20:33:34.7263023Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7263288Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7263527Z     
2025-05-07T20:33:34.7263715Z         if scale_ub is not None:
2025-05-07T20:33:34.7263998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7264340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7264652Z             )
2025-05-07T20:33:34.7264843Z         else:
2025-05-07T20:33:34.7265067Z             scale_ub_tensor = None
2025-05-07T20:33:34.7265325Z     
2025-05-07T20:33:34.7265962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7266293Z             op = silu_mul_quant
2025-05-07T20:33:34.7266552Z             if compiled:
2025-05-07T20:33:34.7266804Z                 op = torch.compile(op)
2025-05-07T20:33:34.7267115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7267399Z     
2025-05-07T20:33:34.7267601Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.7267779Z 
2025-05-07T20:33:34.7267886Z moe/activation_test.py:117: 
2025-05-07T20:33:34.7268195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7268532Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.7268828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7269535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.7270237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.7270777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7271468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7272143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7272687Z     kernel = self.compile(
2025-05-07T20:33:34.7273226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7273884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7274288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7274521Z 
2025-05-07T20:33:34.7274727Z self = <triton.compiler.compiler.ASTSource object at 0x7f2ce5af6ff0>
2025-05-07T20:33:34.7276033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7277473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb256f240>}
2025-05-07T20:33:34.7278821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7279853Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0499970>
2025-05-07T20:33:34.7280144Z 
2025-05-07T20:33:34.7280311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7280843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7281423Z                            module_map=module_map)
2025-05-07T20:33:34.7281792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7282161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.7282428Z E       ^
2025-05-07T20:33:34.7282898Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7283352Z 
2025-05-07T20:33:34.7283766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.7284282Z 
2025-05-07T20:33:34.7284388Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.7284807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.7285226Z     T=4096,
2025-05-07T20:33:34.7285430Z     D=7168,
2025-05-07T20:33:34.7291499Z     scale_ub=None,
2025-05-07T20:33:34.7291742Z     contiguous=False,
2025-05-07T20:33:34.7291986Z     compiled=False,
2025-05-07T20:33:34.7292203Z )
2025-05-07T20:33:34.7292535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7293068Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.7293351Z 
2025-05-07T20:33:34.7293435Z     @given(
2025-05-07T20:33:34.7293682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7294004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7294314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7294654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7294992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7295281Z     )
2025-05-07T20:33:34.7295644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7296095Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7296343Z         self,
2025-05-07T20:33:34.7296543Z         T: int,
2025-05-07T20:33:34.7296754Z         D: int,
2025-05-07T20:33:34.7296975Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7297257Z         contiguous: bool,
2025-05-07T20:33:34.7297509Z         compiled: bool,
2025-05-07T20:33:34.7297736Z     ) -> None:
2025-05-07T20:33:34.7297962Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7298214Z     
2025-05-07T20:33:34.7298488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7298842Z     
2025-05-07T20:33:34.7299045Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7299333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7299648Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7299900Z         x0 = x[:, :D]
2025-05-07T20:33:34.7300119Z         x1 = x[:, D:]
2025-05-07T20:33:34.7300337Z     
2025-05-07T20:33:34.7300534Z         if contiguous:
2025-05-07T20:33:34.7300767Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7301038Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7301406Z     
2025-05-07T20:33:34.7301610Z         if scale_ub is not None:
2025-05-07T20:33:34.7301896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7302239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7302600Z             )
2025-05-07T20:33:34.7302797Z         else:
2025-05-07T20:33:34.7303019Z             scale_ub_tensor = None
2025-05-07T20:33:34.7303278Z     
2025-05-07T20:33:34.7303511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7303838Z             op = silu_mul_quant
2025-05-07T20:33:34.7304098Z             if compiled:
2025-05-07T20:33:34.7304345Z                 op = torch.compile(op)
2025-05-07T20:33:34.7304648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7304931Z     
2025-05-07T20:33:34.7305126Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.7305300Z 
2025-05-07T20:33:34.7305407Z moe/activation_test.py:117: 
2025-05-07T20:33:34.7305721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7306110Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.7306391Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7307085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.7307785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.7308320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7309006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7309678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7310217Z     kernel = self.compile(
2025-05-07T20:33:34.7310761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7311425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7311831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7312067Z 
2025-05-07T20:33:34.7312284Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1604e60>
2025-05-07T20:33:34.7313378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7314766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0aee0c0>}
2025-05-07T20:33:34.7316195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7317243Z context = <triton._C.libtriton.ir.context object at 0x7f2cb034aa70>
2025-05-07T20:33:34.7317531Z 
2025-05-07T20:33:34.7317701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7318231Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7318720Z                            module_map=module_map)
2025-05-07T20:33:34.7319094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7319458Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.7319734Z E       ^
2025-05-07T20:33:34.7320205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7320662Z 
2025-05-07T20:33:34.7321077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.7321597Z 
2025-05-07T20:33:34.7321794Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.7322220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.7322639Z     T=128,
2025-05-07T20:33:34.7322872Z     D=7168,
2025-05-07T20:33:34.7323077Z     scale_ub=None,
2025-05-07T20:33:34.7323305Z     contiguous=False,
2025-05-07T20:33:34.7323533Z     compiled=True,
2025-05-07T20:33:34.7323746Z )
2025-05-07T20:33:34.7872006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7873066Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:34.7873610Z 
2025-05-07T20:33:34.7873767Z     @given(
2025-05-07T20:33:34.7874234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7874869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7875491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7876111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7876464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7876764Z     )
2025-05-07T20:33:34.7877218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7877678Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7877938Z         self,
2025-05-07T20:33:34.7878141Z         T: int,
2025-05-07T20:33:34.7878356Z         D: int,
2025-05-07T20:33:34.7878585Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7878865Z         contiguous: bool,
2025-05-07T20:33:34.7879120Z         compiled: bool,
2025-05-07T20:33:34.7879357Z     ) -> None:
2025-05-07T20:33:34.7879580Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7879836Z     
2025-05-07T20:33:34.7880115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7880476Z     
2025-05-07T20:33:34.7880679Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7880988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7881315Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7881560Z         x0 = x[:, :D]
2025-05-07T20:33:34.7881795Z         x1 = x[:, D:]
2025-05-07T20:33:34.7882019Z     
2025-05-07T20:33:34.7882206Z         if contiguous:
2025-05-07T20:33:34.7882454Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7882719Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7882964Z     
2025-05-07T20:33:34.7883159Z         if scale_ub is not None:
2025-05-07T20:33:34.7883436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7883770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7884082Z             )
2025-05-07T20:33:34.7884281Z         else:
2025-05-07T20:33:34.7884494Z             scale_ub_tensor = None
2025-05-07T20:33:34.7884748Z     
2025-05-07T20:33:34.7884982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7885303Z             op = silu_mul_quant
2025-05-07T20:33:34.7885553Z             if compiled:
2025-05-07T20:33:34.7885812Z                 op = torch.compile(op)
2025-05-07T20:33:34.7886117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7886396Z     
2025-05-07T20:33:34.7886597Z         y_fp8, y_scale = fn()
2025-05-07T20:33:34.7886891Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:34.7887185Z     
2025-05-07T20:33:34.7887424Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7887768Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:34.7888070Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:34.7888387Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:34.7888742Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:34.7889055Z     
2025-05-07T20:33:34.7889266Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:34.7889460Z 
2025-05-07T20:33:34.7889564Z moe/activation_test.py:126: 
2025-05-07T20:33:34.7889860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7890330Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:34.7890663Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:34.7891446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:34.7892256Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:34.7892809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7893492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7894178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:34.7894907Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:34.7895651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:34.7896375Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:34.7896983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:34.7897506Z     fn()
2025-05-07T20:33:34.7898016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:34.7898595Z     self.fn.run(
2025-05-07T20:33:34.7899071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7899605Z     kernel = self.compile(
2025-05-07T20:33:34.7900140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7900793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7901201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7901435Z 
2025-05-07T20:33:34.7901651Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1607aa0>
2025-05-07T20:33:34.7902734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7904110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0aada80>}
2025-05-07T20:33:34.7905452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7906531Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0b6a9f0>
2025-05-07T20:33:34.7906819Z 
2025-05-07T20:33:34.7907001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7907525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7908005Z                            module_map=module_map)
2025-05-07T20:33:34.7908376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7908738Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:34.7909019Z E       ^
2025-05-07T20:33:34.7909491Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7909943Z 
2025-05-07T20:33:34.7910368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.7910875Z 
2025-05-07T20:33:34.7910983Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.7911402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.7911897Z     T=128,
2025-05-07T20:33:34.7912088Z     D=7168,
2025-05-07T20:33:34.7912292Z     scale_ub=None,
2025-05-07T20:33:34.7912512Z     contiguous=False,
2025-05-07T20:33:34.7912742Z     compiled=False,
2025-05-07T20:33:34.7913025Z )
2025-05-07T20:33:34.9878527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.9879135Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.9879413Z 
2025-05-07T20:33:34.9879502Z     @given(
2025-05-07T20:33:34.9879736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.9880059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.9880371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.9880709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.9881037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.9881328Z     )
2025-05-07T20:33:34.9881685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.9882243Z     def test_silu_mul_quant(
2025-05-07T20:33:34.9882495Z         self,
2025-05-07T20:33:34.9882692Z         T: int,
2025-05-07T20:33:34.9882891Z         D: int,
2025-05-07T20:33:34.9883114Z         scale_ub: Optional[float],
2025-05-07T20:33:34.9883389Z         contiguous: bool,
2025-05-07T20:33:34.9883627Z         compiled: bool,
2025-05-07T20:33:34.9883855Z     ) -> None:
2025-05-07T20:33:34.9884076Z         torch.manual_seed(2025)
2025-05-07T20:33:34.9884315Z     
2025-05-07T20:33:34.9884599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.9884950Z     
2025-05-07T20:33:34.9885143Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.9885435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.9885752Z         x = x_sign * x_clamp
2025-05-07T20:33:34.9885996Z         x0 = x[:, :D]
2025-05-07T20:33:34.9886231Z         x1 = x[:, D:]
2025-05-07T20:33:34.9886473Z     
2025-05-07T20:33:34.9886669Z         if contiguous:
2025-05-07T20:33:34.9886905Z             x0 = x0.contiguous()
2025-05-07T20:33:34.9887170Z             x1 = x1.contiguous()
2025-05-07T20:33:34.9887413Z     
2025-05-07T20:33:34.9887608Z         if scale_ub is not None:
2025-05-07T20:33:34.9887883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.9888226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.9888534Z             )
2025-05-07T20:33:34.9888733Z         else:
2025-05-07T20:33:34.9888957Z             scale_ub_tensor = None
2025-05-07T20:33:34.9889209Z     
2025-05-07T20:33:34.9889449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.9889774Z             op = silu_mul_quant
2025-05-07T20:33:34.9890028Z             if compiled:
2025-05-07T20:33:34.9890278Z                 op = torch.compile(op)
2025-05-07T20:33:34.9890581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.9890865Z     
2025-05-07T20:33:34.9891063Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.9891236Z 
2025-05-07T20:33:34.9891344Z moe/activation_test.py:117: 
2025-05-07T20:33:34.9891646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.9891980Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.9892264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.9892954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.9893640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.9894180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.9894864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.9895528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.9896134Z     kernel = self.compile(
2025-05-07T20:33:34.9896779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.9897439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.9897901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.9898132Z 
2025-05-07T20:33:34.9898339Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb147caa0>
2025-05-07T20:33:34.9899421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.9900796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb05e0>}
2025-05-07T20:33:34.9902185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.9903218Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0bae130>
2025-05-07T20:33:34.9903511Z 
2025-05-07T20:33:34.9903678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.9904203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.9904675Z                            module_map=module_map)
2025-05-07T20:33:34.9905035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.9905394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.9905661Z E       ^
2025-05-07T20:33:34.9906124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.9906579Z 
2025-05-07T20:33:34.9906998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.9907514Z 
2025-05-07T20:33:34.9907620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.9908048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.9908454Z     T=4096,
2025-05-07T20:33:34.9908665Z     D=5120,
2025-05-07T20:33:34.9908863Z     scale_ub=1200.0,
2025-05-07T20:33:34.9909089Z     contiguous=True,
2025-05-07T20:33:34.9909311Z     compiled=False,
2025-05-07T20:33:34.9909525Z )
2025-05-07T20:33:34.9909853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.9910349Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:34.9910633Z 
2025-05-07T20:33:34.9910713Z     @given(
2025-05-07T20:33:34.9910951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.9911274Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.9911582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.9911921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.9912255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.9912545Z     )
2025-05-07T20:33:34.9912899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.9913346Z     def test_silu_mul_quant(
2025-05-07T20:33:34.9913587Z         self,
2025-05-07T20:33:34.9913785Z         T: int,
2025-05-07T20:33:34.9913987Z         D: int,
2025-05-07T20:33:34.9914211Z         scale_ub: Optional[float],
2025-05-07T20:33:34.9914487Z         contiguous: bool,
2025-05-07T20:33:34.9914728Z         compiled: bool,
2025-05-07T20:33:34.9914959Z     ) -> None:
2025-05-07T20:33:34.9915185Z         torch.manual_seed(2025)
2025-05-07T20:33:34.9915439Z     
2025-05-07T20:33:34.9915800Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.9916236Z     
2025-05-07T20:33:34.9916439Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.9916736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.9917044Z         x = x_sign * x_clamp
2025-05-07T20:33:34.9917336Z         x0 = x[:, :D]
2025-05-07T20:33:34.9917555Z         x1 = x[:, D:]
2025-05-07T20:33:34.9917761Z     
2025-05-07T20:33:34.9917953Z         if contiguous:
2025-05-07T20:33:34.9918193Z             x0 = x0.contiguous()
2025-05-07T20:33:34.9918454Z             x1 = x1.contiguous()
2025-05-07T20:33:34.9918701Z     
2025-05-07T20:33:34.9918903Z         if scale_ub is not None:
2025-05-07T20:33:34.9919173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.9919509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.9919821Z             )
2025-05-07T20:33:34.9920014Z         else:
2025-05-07T20:33:34.9920229Z             scale_ub_tensor = None
2025-05-07T20:33:34.9920483Z     
2025-05-07T20:33:34.9920722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.9921041Z             op = silu_mul_quant
2025-05-07T20:33:34.9921335Z             if compiled:
2025-05-07T20:33:34.9921587Z                 op = torch.compile(op)
2025-05-07T20:33:34.9921888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.9922164Z     
2025-05-07T20:33:34.9922361Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.9922527Z 
2025-05-07T20:33:34.9922626Z moe/activation_test.py:117: 
2025-05-07T20:33:34.9922924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.9923263Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.9923542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.9924233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.9924921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.9925463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.9926145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.9926810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.9927344Z     kernel = self.compile(
2025-05-07T20:33:34.9927878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.9928530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.9928929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.9929160Z 
2025-05-07T20:33:34.9929370Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb147c650>
2025-05-07T20:33:34.9930450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.9931828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb1b20>}
2025-05-07T20:33:34.9933175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.9934198Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0bbec30>
2025-05-07T20:33:34.9934483Z 
2025-05-07T20:33:34.9934652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.9935169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.9935638Z                            module_map=module_map)
2025-05-07T20:33:34.9936053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.9936442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.9936701Z E       ^
2025-05-07T20:33:34.9937167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.9937659Z 
2025-05-07T20:33:34.9938082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.9938592Z 
2025-05-07T20:33:34.9938697Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.9939112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.9939517Z     T=1,
2025-05-07T20:33:34.9939701Z     D=5120,
2025-05-07T20:33:34.9939899Z     scale_ub=None,
2025-05-07T20:33:34.9940124Z     contiguous=True,
2025-05-07T20:33:34.9940348Z     compiled=True,
2025-05-07T20:33:34.9940555Z )
2025-05-07T20:33:35.3693211Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.3693909Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:35.3694185Z 
2025-05-07T20:33:35.3694267Z     @given(
2025-05-07T20:33:35.3694510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.3694826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.3695144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.3695477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.3695807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.3696092Z     )
2025-05-07T20:33:35.3696442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.3696887Z     def test_silu_mul_quant(
2025-05-07T20:33:35.3697128Z         self,
2025-05-07T20:33:35.3697325Z         T: int,
2025-05-07T20:33:35.3697526Z         D: int,
2025-05-07T20:33:35.3697745Z         scale_ub: Optional[float],
2025-05-07T20:33:35.3698026Z         contiguous: bool,
2025-05-07T20:33:35.3698270Z         compiled: bool,
2025-05-07T20:33:35.3698493Z     ) -> None:
2025-05-07T20:33:35.3698711Z         torch.manual_seed(2025)
2025-05-07T20:33:35.3698955Z     
2025-05-07T20:33:35.3699228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.3699576Z     
2025-05-07T20:33:35.3699772Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.3705352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.3705713Z         x = x_sign * x_clamp
2025-05-07T20:33:35.3705965Z         x0 = x[:, :D]
2025-05-07T20:33:35.3706189Z         x1 = x[:, D:]
2025-05-07T20:33:35.3706433Z     
2025-05-07T20:33:35.3706655Z         if contiguous:
2025-05-07T20:33:35.3706884Z             x0 = x0.contiguous()
2025-05-07T20:33:35.3707143Z             x1 = x1.contiguous()
2025-05-07T20:33:35.3707387Z     
2025-05-07T20:33:35.3707591Z         if scale_ub is not None:
2025-05-07T20:33:35.3707862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.3708214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.3708540Z             )
2025-05-07T20:33:35.3708735Z         else:
2025-05-07T20:33:35.3708956Z             scale_ub_tensor = None
2025-05-07T20:33:35.3709212Z     
2025-05-07T20:33:35.3709443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.3709760Z             op = silu_mul_quant
2025-05-07T20:33:35.3710012Z             if compiled:
2025-05-07T20:33:35.3710263Z                 op = torch.compile(op)
2025-05-07T20:33:35.3710560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.3710832Z     
2025-05-07T20:33:35.3711031Z         y_fp8, y_scale = fn()
2025-05-07T20:33:35.3711322Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:35.3711610Z     
2025-05-07T20:33:35.3711843Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.3712182Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:35.3712648Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:35.3712967Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:35.3713329Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.3713714Z     
2025-05-07T20:33:35.3713917Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:35.3714117Z 
2025-05-07T20:33:35.3714218Z moe/activation_test.py:126: 
2025-05-07T20:33:35.3714517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.3714852Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:35.3715184Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.3716041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:35.3716793Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:35.3717338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.3718066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.3718755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:35.3719481Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:35.3720207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:35.3720849Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:35.3721452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:35.3721963Z     fn()
2025-05-07T20:33:35.3722475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:35.3723063Z     self.fn.run(
2025-05-07T20:33:35.3723529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.3724055Z     kernel = self.compile(
2025-05-07T20:33:35.3724591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.3725244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.3725635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.3725872Z 
2025-05-07T20:33:35.3726077Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb08e53a0>
2025-05-07T20:33:35.3727209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.3728590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb2a20>}
2025-05-07T20:33:35.3729935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.3730953Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbf1a430>
2025-05-07T20:33:35.3731240Z 
2025-05-07T20:33:35.3731404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.3731926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.3732392Z                            module_map=module_map)
2025-05-07T20:33:35.3732748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.3733102Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:35.3733370Z E       ^
2025-05-07T20:33:35.3733917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.3734375Z 
2025-05-07T20:33:35.3734789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.3735345Z 
2025-05-07T20:33:35.3735451Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.3735870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.3736271Z     T=2048,
2025-05-07T20:33:35.3736464Z     D=5120,
2025-05-07T20:33:35.3736666Z     scale_ub=None,
2025-05-07T20:33:35.3736875Z     contiguous=True,
2025-05-07T20:33:35.3737100Z     compiled=True,
2025-05-07T20:33:35.3737306Z )
2025-05-07T20:33:35.7368064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.7368620Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:35.7368947Z 
2025-05-07T20:33:35.7369075Z     @given(
2025-05-07T20:33:35.7369539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.7369878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.7370209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.7370548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.7370890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.7371191Z     )
2025-05-07T20:33:35.7371554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.7372013Z     def test_silu_mul_quant(
2025-05-07T20:33:35.7372270Z         self,
2025-05-07T20:33:35.7372473Z         T: int,
2025-05-07T20:33:35.7372686Z         D: int,
2025-05-07T20:33:35.7372918Z         scale_ub: Optional[float],
2025-05-07T20:33:35.7373207Z         contiguous: bool,
2025-05-07T20:33:35.7373463Z         compiled: bool,
2025-05-07T20:33:35.7373705Z     ) -> None:
2025-05-07T20:33:35.7373929Z         torch.manual_seed(2025)
2025-05-07T20:33:35.7374184Z     
2025-05-07T20:33:35.7374477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.7374838Z     
2025-05-07T20:33:35.7375047Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.7375352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.7375677Z         x = x_sign * x_clamp
2025-05-07T20:33:35.7375930Z         x0 = x[:, :D]
2025-05-07T20:33:35.7376161Z         x1 = x[:, D:]
2025-05-07T20:33:35.7376392Z     
2025-05-07T20:33:35.7376581Z         if contiguous:
2025-05-07T20:33:35.7376824Z             x0 = x0.contiguous()
2025-05-07T20:33:35.7377092Z             x1 = x1.contiguous()
2025-05-07T20:33:35.7377337Z     
2025-05-07T20:33:35.7377544Z         if scale_ub is not None:
2025-05-07T20:33:35.7377831Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.7378172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.7378491Z             )
2025-05-07T20:33:35.7378706Z         else:
2025-05-07T20:33:35.7378920Z             scale_ub_tensor = None
2025-05-07T20:33:35.7379189Z     
2025-05-07T20:33:35.7379435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7379762Z             op = silu_mul_quant
2025-05-07T20:33:35.7380025Z             if compiled:
2025-05-07T20:33:35.7380290Z                 op = torch.compile(op)
2025-05-07T20:33:35.7380607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7380889Z     
2025-05-07T20:33:35.7381106Z         y_fp8, y_scale = fn()
2025-05-07T20:33:35.7381412Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:35.7381708Z     
2025-05-07T20:33:35.7381956Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7382308Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:35.7382615Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:35.7382942Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:35.7383443Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.7383764Z     
2025-05-07T20:33:35.7383979Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:35.7384188Z 
2025-05-07T20:33:35.7384355Z moe/activation_test.py:126: 
2025-05-07T20:33:35.7384661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7385009Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:35.7385344Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.7386138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:35.7386902Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:35.7387457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.7388142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.7388887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:35.7389621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:35.7390370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:35.7391014Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:35.7391623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:35.7392150Z     fn()
2025-05-07T20:33:35.7392664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:35.7393247Z     self.fn.run(
2025-05-07T20:33:35.7393724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.7394267Z     kernel = self.compile(
2025-05-07T20:33:35.7394808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.7395473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.7395990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7396224Z 
2025-05-07T20:33:35.7396442Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb03a9dc0>
2025-05-07T20:33:35.7397533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.7398918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cce020>}
2025-05-07T20:33:35.7400277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.7401314Z context = <triton._C.libtriton.ir.context object at 0x7f2cb02f1af0>
2025-05-07T20:33:35.7401606Z 
2025-05-07T20:33:35.7401782Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.7402309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.7402790Z                            module_map=module_map)
2025-05-07T20:33:35.7403165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.7403530Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:35.7403806Z E       ^
2025-05-07T20:33:35.7404282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.7404790Z 
2025-05-07T20:33:35.7405250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.7405765Z 
2025-05-07T20:33:35.7405936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.7406356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.7406804Z     T=128,
2025-05-07T20:33:35.7407012Z     D=5120,
2025-05-07T20:33:35.7407213Z     scale_ub=None,
2025-05-07T20:33:35.7407442Z     contiguous=True,
2025-05-07T20:33:35.7407666Z     compiled=True,
2025-05-07T20:33:35.7407880Z )
2025-05-07T20:33:36.1673009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.1674313Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.1674855Z 
2025-05-07T20:33:36.1675023Z     @given(
2025-05-07T20:33:36.1675484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.1676271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.1676992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.1677380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.1677712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.1678010Z     )
2025-05-07T20:33:36.1678375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.1678819Z     def test_silu_mul_quant(
2025-05-07T20:33:36.1679066Z         self,
2025-05-07T20:33:36.1679267Z         T: int,
2025-05-07T20:33:36.1679466Z         D: int,
2025-05-07T20:33:36.1679686Z         scale_ub: Optional[float],
2025-05-07T20:33:36.1679958Z         contiguous: bool,
2025-05-07T20:33:36.1680200Z         compiled: bool,
2025-05-07T20:33:36.1680432Z     ) -> None:
2025-05-07T20:33:36.1680656Z         torch.manual_seed(2025)
2025-05-07T20:33:36.1680896Z     
2025-05-07T20:33:36.1681178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.1681531Z     
2025-05-07T20:33:36.1681737Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.1682027Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.1682349Z         x = x_sign * x_clamp
2025-05-07T20:33:36.1682598Z         x0 = x[:, :D]
2025-05-07T20:33:36.1682815Z         x1 = x[:, D:]
2025-05-07T20:33:36.1683031Z     
2025-05-07T20:33:36.1683220Z         if contiguous:
2025-05-07T20:33:36.1683449Z             x0 = x0.contiguous()
2025-05-07T20:33:36.1683716Z             x1 = x1.contiguous()
2025-05-07T20:33:36.1683967Z     
2025-05-07T20:33:36.1684163Z         if scale_ub is not None:
2025-05-07T20:33:36.1684440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.1684784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.1685095Z             )
2025-05-07T20:33:36.1685289Z         else:
2025-05-07T20:33:36.1685506Z             scale_ub_tensor = None
2025-05-07T20:33:36.1685756Z     
2025-05-07T20:33:36.1685996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.1686318Z             op = silu_mul_quant
2025-05-07T20:33:36.1686576Z             if compiled:
2025-05-07T20:33:36.1686824Z                 op = torch.compile(op)
2025-05-07T20:33:36.1687129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.1687415Z     
2025-05-07T20:33:36.1687608Z         y_fp8, y_scale = fn()
2025-05-07T20:33:36.1687898Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:36.1688197Z     
2025-05-07T20:33:36.1688436Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.1688777Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:36.1689078Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:36.1689389Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:36.1689755Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.1690071Z     
2025-05-07T20:33:36.1690344Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:36.1690602Z 
2025-05-07T20:33:36.1690706Z moe/activation_test.py:126: 
2025-05-07T20:33:36.1691009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.1691408Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:36.1691734Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.1692526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:36.1693283Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:36.1693832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.1694510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.1695201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:36.1695978Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:36.1696708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:36.1697409Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:36.1698013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:36.1698536Z     fn()
2025-05-07T20:33:36.1699040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:36.1699624Z     self.fn.run(
2025-05-07T20:33:36.1700092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.1700623Z     kernel = self.compile(
2025-05-07T20:33:36.1701166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.1701826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.1702229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.1702465Z 
2025-05-07T20:33:36.1702673Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0cf7da0>
2025-05-07T20:33:36.1703758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.1705139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0e3aac0>}
2025-05-07T20:33:36.1706489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.1707571Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb91abb0>
2025-05-07T20:33:36.1707861Z 
2025-05-07T20:33:36.1708030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.1708558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.1709034Z                            module_map=module_map)
2025-05-07T20:33:36.1709396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.1709758Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:36.1710034Z E       ^
2025-05-07T20:33:36.1710502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.1710955Z 
2025-05-07T20:33:36.1711368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.1711973Z 
2025-05-07T20:33:36.1712080Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.1712498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.1712937Z     T=4096,
2025-05-07T20:33:36.1713132Z     D=5120,
2025-05-07T20:33:36.1713333Z     scale_ub=None,
2025-05-07T20:33:36.1713555Z     contiguous=True,
2025-05-07T20:33:36.1713776Z     compiled=True,
2025-05-07T20:33:36.1713989Z )
2025-05-07T20:33:36.5971469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5971989Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.5972288Z 
2025-05-07T20:33:36.5972393Z     @given(
2025-05-07T20:33:36.5972718Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5973077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5973390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5973735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5974195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5974498Z     )
2025-05-07T20:33:36.5974857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5975305Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5975564Z         self,
2025-05-07T20:33:36.5975765Z         T: int,
2025-05-07T20:33:36.5975957Z         D: int,
2025-05-07T20:33:36.5976177Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5976454Z         contiguous: bool,
2025-05-07T20:33:36.5976690Z         compiled: bool,
2025-05-07T20:33:36.5976925Z     ) -> None:
2025-05-07T20:33:36.5977176Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5977442Z     
2025-05-07T20:33:36.5977711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5978066Z     
2025-05-07T20:33:36.5978264Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.5978554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.5978875Z         x = x_sign * x_clamp
2025-05-07T20:33:36.5979121Z         x0 = x[:, :D]
2025-05-07T20:33:36.5979334Z         x1 = x[:, D:]
2025-05-07T20:33:36.5979552Z     
2025-05-07T20:33:36.5979748Z         if contiguous:
2025-05-07T20:33:36.5980180Z             x0 = x0.contiguous()
2025-05-07T20:33:36.5980446Z             x1 = x1.contiguous()
2025-05-07T20:33:36.5980694Z     
2025-05-07T20:33:36.5980885Z         if scale_ub is not None:
2025-05-07T20:33:36.5981161Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.5981503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.5981810Z             )
2025-05-07T20:33:36.5982008Z         else:
2025-05-07T20:33:36.5982220Z             scale_ub_tensor = None
2025-05-07T20:33:36.5982471Z     
2025-05-07T20:33:36.5982708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.5983030Z             op = silu_mul_quant
2025-05-07T20:33:36.5983283Z             if compiled:
2025-05-07T20:33:36.5983542Z                 op = torch.compile(op)
2025-05-07T20:33:36.5983843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.5984120Z     
2025-05-07T20:33:36.5984322Z         y_fp8, y_scale = fn()
2025-05-07T20:33:36.5984613Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:36.5984908Z     
2025-05-07T20:33:36.5985146Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.5985485Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:36.5985786Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:36.5986098Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:36.5986457Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.5986774Z     
2025-05-07T20:33:36.5986972Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:36.5987175Z 
2025-05-07T20:33:36.5987278Z moe/activation_test.py:126: 
2025-05-07T20:33:36.5987660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.5988058Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:36.5988382Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.5989241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:36.5989996Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:36.5990539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.5991222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.5991916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:36.5992642Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:36.5993372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:36.5994055Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:36.5994659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:36.5995177Z     fn()
2025-05-07T20:33:36.5995757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:36.5996341Z     self.fn.run(
2025-05-07T20:33:36.5996810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.5997335Z     kernel = self.compile(
2025-05-07T20:33:36.5997876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.5998526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.5998936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.5999169Z 
2025-05-07T20:33:36.5999379Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2e510>
2025-05-07T20:33:36.6000466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.6001848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbcfb560>}
2025-05-07T20:33:36.6003189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.6004208Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0137930>
2025-05-07T20:33:36.6004507Z 
2025-05-07T20:33:36.6004675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.6005200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.6005678Z                            module_map=module_map)
2025-05-07T20:33:36.6006039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.6006397Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:36.6006666Z E       ^
2025-05-07T20:33:36.6007176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.6007632Z 
2025-05-07T20:33:36.6008055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.6008572Z 
2025-05-07T20:33:36.6008676Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6009173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6009611Z     T=16384,
2025-05-07T20:33:36.6009810Z     D=5120,
2025-05-07T20:33:36.6010007Z     scale_ub=None,
2025-05-07T20:33:36.6010221Z     contiguous=True,
2025-05-07T20:33:36.6010489Z     compiled=True,
2025-05-07T20:33:36.6010694Z )
2025-05-07T20:33:36.6270145Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:36.6277785Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:36.6279129Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:36.6280152Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:36.6281378Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:36.7156334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.7157421Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.7157799Z 
2025-05-07T20:33:36.7157883Z     @given(
2025-05-07T20:33:36.7158128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.7158447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.7158770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.7159120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.7159478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.7159769Z     )
2025-05-07T20:33:36.7160137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.7160590Z     def test_silu_mul_quant(
2025-05-07T20:33:36.7160842Z         self,
2025-05-07T20:33:36.7161049Z         T: int,
2025-05-07T20:33:36.7161247Z         D: int,
2025-05-07T20:33:36.7161470Z         scale_ub: Optional[float],
2025-05-07T20:33:36.7161753Z         contiguous: bool,
2025-05-07T20:33:36.7161991Z         compiled: bool,
2025-05-07T20:33:36.7162229Z     ) -> None:
2025-05-07T20:33:36.7162453Z         torch.manual_seed(2025)
2025-05-07T20:33:36.7162702Z     
2025-05-07T20:33:36.7162977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.7163330Z     
2025-05-07T20:33:36.7163528Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.7163820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.7164135Z         x = x_sign * x_clamp
2025-05-07T20:33:36.7164386Z         x0 = x[:, :D]
2025-05-07T20:33:36.7164609Z         x1 = x[:, D:]
2025-05-07T20:33:36.7164819Z     
2025-05-07T20:33:36.7165009Z         if contiguous:
2025-05-07T20:33:36.7165241Z             x0 = x0.contiguous()
2025-05-07T20:33:36.7165835Z             x1 = x1.contiguous()
2025-05-07T20:33:36.7166092Z     
2025-05-07T20:33:36.7166285Z         if scale_ub is not None:
2025-05-07T20:33:36.7166560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.7166897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.7167211Z             )
2025-05-07T20:33:36.7167408Z         else:
2025-05-07T20:33:36.7167625Z             scale_ub_tensor = None
2025-05-07T20:33:36.7167882Z     
2025-05-07T20:33:36.7168110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.7168429Z             op = silu_mul_quant
2025-05-07T20:33:36.7168681Z             if compiled:
2025-05-07T20:33:36.7168925Z                 op = torch.compile(op)
2025-05-07T20:33:36.7169226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.7169677Z     
2025-05-07T20:33:36.7169871Z         y_fp8, y_scale = fn()
2025-05-07T20:33:36.7170162Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:36.7170457Z     
2025-05-07T20:33:36.7170754Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.7171093Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:36.7171398Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:36.7171710Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:36.7172070Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.7172386Z     
2025-05-07T20:33:36.7172588Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:36.7172784Z 
2025-05-07T20:33:36.7172885Z moe/activation_test.py:126: 
2025-05-07T20:33:36.7173187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.7173529Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:36.7173859Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.7174706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:36.7175466Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:36.7176014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.7176697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.7177388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:36.7178114Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:36.7178851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:36.7179492Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:36.7180104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:36.7180622Z     fn()
2025-05-07T20:33:36.7181127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:36.7181706Z     self.fn.run(
2025-05-07T20:33:36.7182176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.7182708Z     kernel = self.compile(
2025-05-07T20:33:36.7183241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.7183897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.7184297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.7184527Z 
2025-05-07T20:33:36.7184737Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2d1c0>
2025-05-07T20:33:36.7185824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.7187204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb551620>}
2025-05-07T20:33:36.7188546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.7189571Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb779bf0>
2025-05-07T20:33:36.7189862Z 
2025-05-07T20:33:36.7190028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.7190649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.7191122Z                            module_map=module_map)
2025-05-07T20:33:36.7191522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.7191884Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:36.7192156Z E       ^
2025-05-07T20:33:36.7192618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.7193071Z 
2025-05-07T20:33:36.7193483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.7193996Z 
2025-05-07T20:33:36.7194102Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.7194519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.7194925Z     T=1,
2025-05-07T20:33:36.7195108Z     D=5120,
2025-05-07T20:33:36.7195315Z     scale_ub=1200.0,
2025-05-07T20:33:36.7195548Z     contiguous=True,
2025-05-07T20:33:36.7195867Z     compiled=True,
2025-05-07T20:33:36.7196072Z )
2025-05-07T20:33:36.8632540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.8633207Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:36.8633473Z 
2025-05-07T20:33:36.8633554Z     @given(
2025-05-07T20:33:36.8633787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.8634103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.8634407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.8634739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.8635067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.8635361Z     )
2025-05-07T20:33:36.8635767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.8636219Z     def test_silu_mul_quant(
2025-05-07T20:33:36.8636465Z         self,
2025-05-07T20:33:36.8636655Z         T: int,
2025-05-07T20:33:36.8636855Z         D: int,
2025-05-07T20:33:36.8637072Z         scale_ub: Optional[float],
2025-05-07T20:33:36.8637340Z         contiguous: bool,
2025-05-07T20:33:36.8637586Z         compiled: bool,
2025-05-07T20:33:36.8637815Z     ) -> None:
2025-05-07T20:33:36.8638026Z         torch.manual_seed(2025)
2025-05-07T20:33:36.8638266Z     
2025-05-07T20:33:36.8638537Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.8638881Z     
2025-05-07T20:33:36.8639074Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.8639361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.8639667Z         x = x_sign * x_clamp
2025-05-07T20:33:36.8639909Z         x0 = x[:, :D]
2025-05-07T20:33:36.8640125Z         x1 = x[:, D:]
2025-05-07T20:33:36.8640331Z     
2025-05-07T20:33:36.8640512Z         if contiguous:
2025-05-07T20:33:36.8640745Z             x0 = x0.contiguous()
2025-05-07T20:33:36.8641007Z             x1 = x1.contiguous()
2025-05-07T20:33:36.8641245Z     
2025-05-07T20:33:36.8641446Z         if scale_ub is not None:
2025-05-07T20:33:36.8641716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.8642052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.8642369Z             )
2025-05-07T20:33:36.8642560Z         else:
2025-05-07T20:33:36.8642767Z             scale_ub_tensor = None
2025-05-07T20:33:36.8643021Z     
2025-05-07T20:33:36.8643251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.8643562Z             op = silu_mul_quant
2025-05-07T20:33:36.8643814Z             if compiled:
2025-05-07T20:33:36.8644060Z                 op = torch.compile(op)
2025-05-07T20:33:36.8644354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8644630Z     
2025-05-07T20:33:36.8644822Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.8644985Z 
2025-05-07T20:33:36.8645206Z moe/activation_test.py:117: 
2025-05-07T20:33:36.8645560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8645888Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.8646229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8646780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:36.8647338Z     return fn(*args, **kwargs)
2025-05-07T20:33:36.8647993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.8648677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.8649204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.8649881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.8650541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.8651125Z     kernel = self.compile(
2025-05-07T20:33:36.8651659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.8652313Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.8652709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8652936Z 
2025-05-07T20:33:36.8653141Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb998440>
2025-05-07T20:33:36.8654221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.8655596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5c900>}
2025-05-07T20:33:36.8656944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.8657977Z context = <triton._C.libtriton.ir.context object at 0x7f2bfae3d970>
2025-05-07T20:33:36.8658267Z 
2025-05-07T20:33:36.8658431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.8658951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.8659425Z                            module_map=module_map)
2025-05-07T20:33:36.8659780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.8660133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.8660390Z E       ^
2025-05-07T20:33:36.8660858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.8661307Z 
2025-05-07T20:33:36.8661721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.8662234Z 
2025-05-07T20:33:36.8662339Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.8662748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.8663141Z     T=1,
2025-05-07T20:33:36.8663324Z     D=5120,
2025-05-07T20:33:36.8663515Z     scale_ub=None,
2025-05-07T20:33:36.8663725Z     contiguous=False,
2025-05-07T20:33:36.8663955Z     compiled=True,
2025-05-07T20:33:36.8664161Z )
2025-05-07T20:33:37.0812893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.0813538Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.0813844Z 
2025-05-07T20:33:37.0813931Z     @given(
2025-05-07T20:33:37.0814278Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.0814683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.0814998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.0815328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.0815724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.0816013Z     )
2025-05-07T20:33:37.0816373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.0816815Z     def test_silu_mul_quant(
2025-05-07T20:33:37.0817071Z         self,
2025-05-07T20:33:37.0817275Z         T: int,
2025-05-07T20:33:37.0817482Z         D: int,
2025-05-07T20:33:37.0817707Z         scale_ub: Optional[float],
2025-05-07T20:33:37.0817983Z         contiguous: bool,
2025-05-07T20:33:37.0818228Z         compiled: bool,
2025-05-07T20:33:37.0818465Z     ) -> None:
2025-05-07T20:33:37.0818683Z         torch.manual_seed(2025)
2025-05-07T20:33:37.0818924Z     
2025-05-07T20:33:37.0819206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0819558Z     
2025-05-07T20:33:37.0819820Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.0820119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.0820440Z         x = x_sign * x_clamp
2025-05-07T20:33:37.0820679Z         x0 = x[:, :D]
2025-05-07T20:33:37.0820898Z         x1 = x[:, D:]
2025-05-07T20:33:37.0821109Z     
2025-05-07T20:33:37.0821293Z         if contiguous:
2025-05-07T20:33:37.0821529Z             x0 = x0.contiguous()
2025-05-07T20:33:37.0821789Z             x1 = x1.contiguous()
2025-05-07T20:33:37.0822035Z     
2025-05-07T20:33:37.0822224Z         if scale_ub is not None:
2025-05-07T20:33:37.0822501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.0822841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.0823150Z             )
2025-05-07T20:33:37.0823346Z         else:
2025-05-07T20:33:37.0823562Z             scale_ub_tensor = None
2025-05-07T20:33:37.0823818Z     
2025-05-07T20:33:37.0824054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.0824374Z             op = silu_mul_quant
2025-05-07T20:33:37.0824624Z             if compiled:
2025-05-07T20:33:37.0824878Z                 op = torch.compile(op)
2025-05-07T20:33:37.0825175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.0825452Z     
2025-05-07T20:33:37.0825651Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.0825945Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.0826234Z     
2025-05-07T20:33:37.0826475Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.0826815Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.0827118Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.0827432Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.0827799Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.0828117Z     
2025-05-07T20:33:37.0828321Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.0828526Z 
2025-05-07T20:33:37.0828630Z moe/activation_test.py:126: 
2025-05-07T20:33:37.0828931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.0829268Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.0829606Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.0830394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.0831154Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.0831697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.0832389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.0833134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.0833897Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.0834632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.0835323Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.0836016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.0836542Z     fn()
2025-05-07T20:33:37.0837055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.0837645Z     self.fn.run(
2025-05-07T20:33:37.0838119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.0838651Z     kernel = self.compile(
2025-05-07T20:33:37.0839199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.0839911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.0840315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.0840560Z 
2025-05-07T20:33:37.0840770Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb53d3d0>
2025-05-07T20:33:37.0841863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.0843247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5ec00>}
2025-05-07T20:33:37.0844603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.0845630Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaf1b1b0>
2025-05-07T20:33:37.0845931Z 
2025-05-07T20:33:37.0846099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.0846628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.0847107Z                            module_map=module_map)
2025-05-07T20:33:37.0847476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.0847846Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.0848125Z E       ^
2025-05-07T20:33:37.0848590Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.0849053Z 
2025-05-07T20:33:37.0849472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.0849989Z 
2025-05-07T20:33:37.0850098Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.0850517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0850927Z     T=1,
2025-05-07T20:33:37.0851123Z     D=5120,
2025-05-07T20:33:37.0851329Z     scale_ub=None,
2025-05-07T20:33:37.0851543Z     contiguous=True,
2025-05-07T20:33:37.0851771Z     compiled=False,
2025-05-07T20:33:37.0851984Z )
2025-05-07T20:33:37.2358053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2358742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2359134Z 
2025-05-07T20:33:37.2359250Z     @given(
2025-05-07T20:33:37.2359568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2360002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2360447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2360842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2361185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2361476Z     )
2025-05-07T20:33:37.2361891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2362343Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2362595Z         self,
2025-05-07T20:33:37.2362799Z         T: int,
2025-05-07T20:33:37.2363004Z         D: int,
2025-05-07T20:33:37.2363217Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2363495Z         contiguous: bool,
2025-05-07T20:33:37.2363740Z         compiled: bool,
2025-05-07T20:33:37.2363973Z     ) -> None:
2025-05-07T20:33:37.2364203Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2364461Z     
2025-05-07T20:33:37.2364738Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2365077Z     
2025-05-07T20:33:37.2365275Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2365827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2366216Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2366462Z         x0 = x[:, :D]
2025-05-07T20:33:37.2366681Z         x1 = x[:, D:]
2025-05-07T20:33:37.2366889Z     
2025-05-07T20:33:37.2367076Z         if contiguous:
2025-05-07T20:33:37.2367320Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2367627Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2367875Z     
2025-05-07T20:33:37.2368073Z         if scale_ub is not None:
2025-05-07T20:33:37.2368341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2368674Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2369078Z             )
2025-05-07T20:33:37.2369354Z         else:
2025-05-07T20:33:37.2369645Z             scale_ub_tensor = None
2025-05-07T20:33:37.2369990Z     
2025-05-07T20:33:37.2370258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2370572Z             op = silu_mul_quant
2025-05-07T20:33:37.2370830Z             if compiled:
2025-05-07T20:33:37.2371083Z                 op = torch.compile(op)
2025-05-07T20:33:37.2371374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2371656Z     
2025-05-07T20:33:37.2371850Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2372014Z 
2025-05-07T20:33:37.2372112Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2372410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2372746Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2373022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2373707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2374398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2374932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2375615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2376278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2376813Z     kernel = self.compile(
2025-05-07T20:33:37.2377351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2378007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2378408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2378639Z 
2025-05-07T20:33:37.2378855Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2fe90>
2025-05-07T20:33:37.2380022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2381448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5f6a0>}
2025-05-07T20:33:37.2382859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2383886Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaece930>
2025-05-07T20:33:37.2384173Z 
2025-05-07T20:33:37.2384346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2384865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2385335Z                            module_map=module_map)
2025-05-07T20:33:37.2385700Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2386064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2386331Z E       ^
2025-05-07T20:33:37.2386842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2387297Z 
2025-05-07T20:33:37.2387722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2388231Z 
2025-05-07T20:33:37.2388339Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2388754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2389161Z     T=128,
2025-05-07T20:33:37.2389359Z     D=5120,
2025-05-07T20:33:37.2389552Z     scale_ub=None,
2025-05-07T20:33:37.2389769Z     contiguous=False,
2025-05-07T20:33:37.2389994Z     compiled=True,
2025-05-07T20:33:37.2390198Z )
2025-05-07T20:33:37.2390522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2391045Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.2391315Z 
2025-05-07T20:33:37.2391401Z     @given(
2025-05-07T20:33:37.2391638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2397848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2398196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2398540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2398880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2399168Z     )
2025-05-07T20:33:37.2399521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2399964Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2400208Z         self,
2025-05-07T20:33:37.2400413Z         T: int,
2025-05-07T20:33:37.2400617Z         D: int,
2025-05-07T20:33:37.2400849Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2401127Z         contiguous: bool,
2025-05-07T20:33:37.2401376Z         compiled: bool,
2025-05-07T20:33:37.2401615Z     ) -> None:
2025-05-07T20:33:37.2401840Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2402090Z     
2025-05-07T20:33:37.2402369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2402725Z     
2025-05-07T20:33:37.2402920Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2403208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2403524Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2403772Z         x0 = x[:, :D]
2025-05-07T20:33:37.2403991Z         x1 = x[:, D:]
2025-05-07T20:33:37.2404199Z     
2025-05-07T20:33:37.2404388Z         if contiguous:
2025-05-07T20:33:37.2404621Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2404889Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2405143Z     
2025-05-07T20:33:37.2405336Z         if scale_ub is not None:
2025-05-07T20:33:37.2405624Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2406042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2406392Z             )
2025-05-07T20:33:37.2406587Z         else:
2025-05-07T20:33:37.2406802Z             scale_ub_tensor = None
2025-05-07T20:33:37.2407055Z     
2025-05-07T20:33:37.2407389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2407716Z             op = silu_mul_quant
2025-05-07T20:33:37.2407975Z             if compiled:
2025-05-07T20:33:37.2408221Z                 op = torch.compile(op)
2025-05-07T20:33:37.2408518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2408794Z     
2025-05-07T20:33:37.2408987Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2409156Z 
2025-05-07T20:33:37.2409258Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2409568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2409905Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2410184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2410748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.2411358Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.2412014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2412713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2413254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2413934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2414593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2415126Z     kernel = self.compile(
2025-05-07T20:33:37.2415675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2416336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2416746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2416984Z 
2025-05-07T20:33:37.2417196Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbef73e0>
2025-05-07T20:33:37.2418329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2419871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5f100>}
2025-05-07T20:33:37.2421229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2422264Z context = <triton._C.libtriton.ir.context object at 0x7f2bfafaf1f0>
2025-05-07T20:33:37.2422560Z 
2025-05-07T20:33:37.2422738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2423273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2423747Z                            module_map=module_map)
2025-05-07T20:33:37.2424115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2424474Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2424735Z E       ^
2025-05-07T20:33:37.2425204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2425660Z 
2025-05-07T20:33:37.2426082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2426594Z 
2025-05-07T20:33:37.2426768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2427225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2427666Z     T=128,
2025-05-07T20:33:37.2427891Z     D=7168,
2025-05-07T20:33:37.2428127Z     scale_ub=1200.0,
2025-05-07T20:33:37.2428357Z     contiguous=False,
2025-05-07T20:33:37.2428585Z     compiled=False,
2025-05-07T20:33:37.2428791Z )
2025-05-07T20:33:37.3561799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.3562545Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.3562921Z 
2025-05-07T20:33:37.3563029Z     @given(
2025-05-07T20:33:37.3563348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.3563773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.3564109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.3564437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.3564765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.3565058Z     )
2025-05-07T20:33:37.3565703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.3566153Z     def test_silu_mul_quant(
2025-05-07T20:33:37.3566393Z         self,
2025-05-07T20:33:37.3566595Z         T: int,
2025-05-07T20:33:37.3566787Z         D: int,
2025-05-07T20:33:37.3567003Z         scale_ub: Optional[float],
2025-05-07T20:33:37.3567281Z         contiguous: bool,
2025-05-07T20:33:37.3567564Z         compiled: bool,
2025-05-07T20:33:37.3567801Z     ) -> None:
2025-05-07T20:33:37.3568021Z         torch.manual_seed(2025)
2025-05-07T20:33:37.3568271Z     
2025-05-07T20:33:37.3568538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.3568886Z     
2025-05-07T20:33:37.3569078Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.3569362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.3569682Z         x = x_sign * x_clamp
2025-05-07T20:33:37.3569934Z         x0 = x[:, :D]
2025-05-07T20:33:37.3570158Z         x1 = x[:, D:]
2025-05-07T20:33:37.3570371Z     
2025-05-07T20:33:37.3570567Z         if contiguous:
2025-05-07T20:33:37.3570801Z             x0 = x0.contiguous()
2025-05-07T20:33:37.3571062Z             x1 = x1.contiguous()
2025-05-07T20:33:37.3571305Z     
2025-05-07T20:33:37.3571494Z         if scale_ub is not None:
2025-05-07T20:33:37.3571767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.3572107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.3572415Z             )
2025-05-07T20:33:37.3572607Z         else:
2025-05-07T20:33:37.3572815Z             scale_ub_tensor = None
2025-05-07T20:33:37.3573072Z     
2025-05-07T20:33:37.3573299Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.3573613Z             op = silu_mul_quant
2025-05-07T20:33:37.3573869Z             if compiled:
2025-05-07T20:33:37.3574111Z                 op = torch.compile(op)
2025-05-07T20:33:37.3574408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.3574684Z     
2025-05-07T20:33:37.3574874Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.3575039Z 
2025-05-07T20:33:37.3575140Z moe/activation_test.py:117: 
2025-05-07T20:33:37.3575437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.3575764Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.3576044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.3576731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.3577445Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.3578000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.3578678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.3579415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.3579998Z     kernel = self.compile(
2025-05-07T20:33:37.3580537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.3581250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.3581648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.3581880Z 
2025-05-07T20:33:37.3582087Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbef4a10>
2025-05-07T20:33:37.3583168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.3584546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbcf9940>}
2025-05-07T20:33:37.3585931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.3586958Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaf70c70>
2025-05-07T20:33:37.3587246Z 
2025-05-07T20:33:37.3587414Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.3587943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.3588414Z                            module_map=module_map)
2025-05-07T20:33:37.3588772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.3589135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.3589404Z E       ^
2025-05-07T20:33:37.3589872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.3590327Z 
2025-05-07T20:33:37.3590739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.3591254Z 
2025-05-07T20:33:37.3591356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.3591772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.3592175Z     T=128,
2025-05-07T20:33:37.3592362Z     D=5120,
2025-05-07T20:33:37.3592562Z     scale_ub=None,
2025-05-07T20:33:37.3592782Z     contiguous=False,
2025-05-07T20:33:37.3593003Z     compiled=False,
2025-05-07T20:33:37.3593206Z )
2025-05-07T20:33:37.3593525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.3594037Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.3594305Z 
2025-05-07T20:33:37.3594383Z     @given(
2025-05-07T20:33:37.3594620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.3594937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.3595243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.3595582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.3596013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.3596293Z     )
2025-05-07T20:33:37.3596642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.3597090Z     def test_silu_mul_quant(
2025-05-07T20:33:37.3597330Z         self,
2025-05-07T20:33:37.3597538Z         T: int,
2025-05-07T20:33:37.3597771Z         D: int,
2025-05-07T20:33:37.3597990Z         scale_ub: Optional[float],
2025-05-07T20:33:37.3598259Z         contiguous: bool,
2025-05-07T20:33:37.3598498Z         compiled: bool,
2025-05-07T20:33:37.3598721Z     ) -> None:
2025-05-07T20:33:37.3598931Z         torch.manual_seed(2025)
2025-05-07T20:33:37.3599231Z     
2025-05-07T20:33:37.3599539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.3599876Z     
2025-05-07T20:33:37.3600071Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.3600426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.3600735Z         x = x_sign * x_clamp
2025-05-07T20:33:37.3600979Z         x0 = x[:, :D]
2025-05-07T20:33:37.3601198Z         x1 = x[:, D:]
2025-05-07T20:33:37.3601395Z     
2025-05-07T20:33:37.3601582Z         if contiguous:
2025-05-07T20:33:37.3601825Z             x0 = x0.contiguous()
2025-05-07T20:33:37.3602081Z             x1 = x1.contiguous()
2025-05-07T20:33:37.3602329Z     
2025-05-07T20:33:37.3602522Z         if scale_ub is not None:
2025-05-07T20:33:37.3602790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.3603122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.3603432Z             )
2025-05-07T20:33:37.3603623Z         else:
2025-05-07T20:33:37.3603834Z             scale_ub_tensor = None
2025-05-07T20:33:37.3604084Z     
2025-05-07T20:33:37.3604363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.3604674Z             op = silu_mul_quant
2025-05-07T20:33:37.3604927Z             if compiled:
2025-05-07T20:33:37.3605176Z                 op = torch.compile(op)
2025-05-07T20:33:37.3605469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.3605747Z     
2025-05-07T20:33:37.3605947Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.3606109Z 
2025-05-07T20:33:37.3606211Z moe/activation_test.py:117: 
2025-05-07T20:33:37.3606511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.3606848Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.3607125Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.3607809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.3608499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.3609039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.3609713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.3610372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.3610902Z     kernel = self.compile(
2025-05-07T20:33:37.3611447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.3612092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.3612487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.3612718Z 
2025-05-07T20:33:37.3612929Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf11640>
2025-05-07T20:33:37.3614008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.3615379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed04a0>}
2025-05-07T20:33:37.3616719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.3617740Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb3b8e30>
2025-05-07T20:33:37.3618027Z 
2025-05-07T20:33:37.3618200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.3618764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.3619271Z                            module_map=module_map)
2025-05-07T20:33:37.3619635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.3619991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.3620294Z E       ^
2025-05-07T20:33:37.3620753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.3621200Z 
2025-05-07T20:33:37.3621616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.3622120Z 
2025-05-07T20:33:37.3622225Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.3622638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.3623041Z     T=128,
2025-05-07T20:33:37.3623228Z     D=5120,
2025-05-07T20:33:37.3623420Z     scale_ub=1200.0,
2025-05-07T20:33:37.3623646Z     contiguous=True,
2025-05-07T20:33:37.3623876Z     compiled=False,
2025-05-07T20:33:37.3624081Z )
2025-05-07T20:33:37.5367089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.5367844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.5368277Z 
2025-05-07T20:33:37.5368391Z     @given(
2025-05-07T20:33:37.5368712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.5369128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.5369455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.5369798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.5370138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.5370428Z     )
2025-05-07T20:33:37.5370785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.5371233Z     def test_silu_mul_quant(
2025-05-07T20:33:37.5371481Z         self,
2025-05-07T20:33:37.5371684Z         T: int,
2025-05-07T20:33:37.5371895Z         D: int,
2025-05-07T20:33:37.5372120Z         scale_ub: Optional[float],
2025-05-07T20:33:37.5372408Z         contiguous: bool,
2025-05-07T20:33:37.5372654Z         compiled: bool,
2025-05-07T20:33:37.5372886Z     ) -> None:
2025-05-07T20:33:37.5373110Z         torch.manual_seed(2025)
2025-05-07T20:33:37.5373357Z     
2025-05-07T20:33:37.5373626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.5373976Z     
2025-05-07T20:33:37.5374178Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.5374468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.5374788Z         x = x_sign * x_clamp
2025-05-07T20:33:37.5375030Z         x0 = x[:, :D]
2025-05-07T20:33:37.5375254Z         x1 = x[:, D:]
2025-05-07T20:33:37.5375456Z     
2025-05-07T20:33:37.5375643Z         if contiguous:
2025-05-07T20:33:37.5375882Z             x0 = x0.contiguous()
2025-05-07T20:33:37.5376140Z             x1 = x1.contiguous()
2025-05-07T20:33:37.5376387Z     
2025-05-07T20:33:37.5376586Z         if scale_ub is not None:
2025-05-07T20:33:37.5376861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.5377201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.5377513Z             )
2025-05-07T20:33:37.5377704Z         else:
2025-05-07T20:33:37.5377919Z             scale_ub_tensor = None
2025-05-07T20:33:37.5378172Z     
2025-05-07T20:33:37.5378403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.5378730Z             op = silu_mul_quant
2025-05-07T20:33:37.5378981Z             if compiled:
2025-05-07T20:33:37.5379233Z                 op = torch.compile(op)
2025-05-07T20:33:37.5379535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.5379811Z     
2025-05-07T20:33:37.5380000Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.5380168Z 
2025-05-07T20:33:37.5380268Z moe/activation_test.py:117: 
2025-05-07T20:33:37.5380643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.5381025Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.5381306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.5381994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.5382741Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.5383272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.5383956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.5384617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.5385153Z     kernel = self.compile(
2025-05-07T20:33:37.5385684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.5386343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.5386785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.5387018Z 
2025-05-07T20:33:37.5387234Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf11b50>
2025-05-07T20:33:37.5388363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.5389738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed13a0>}
2025-05-07T20:33:37.5391085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.5392121Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb31f830>
2025-05-07T20:33:37.5392410Z 
2025-05-07T20:33:37.5392577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.5393102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.5393572Z                            module_map=module_map)
2025-05-07T20:33:37.5393935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.5394288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.5394551Z E       ^
2025-05-07T20:33:37.5395016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.5395466Z 
2025-05-07T20:33:37.5396007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.5396520Z 
2025-05-07T20:33:37.5396629Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.5397050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.5397457Z     T=1,
2025-05-07T20:33:37.5397646Z     D=7168,
2025-05-07T20:33:37.5397871Z     scale_ub=1200.0,
2025-05-07T20:33:37.5398124Z     contiguous=True,
2025-05-07T20:33:37.5398344Z     compiled=True,
2025-05-07T20:33:37.5398550Z )
2025-05-07T20:33:37.5398876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.5399362Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.5399625Z 
2025-05-07T20:33:37.5399708Z     @given(
2025-05-07T20:33:37.5399946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.5400260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.5400565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.5400893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.5401273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.5401598Z     )
2025-05-07T20:33:37.5401951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.5402431Z     def test_silu_mul_quant(
2025-05-07T20:33:37.5402672Z         self,
2025-05-07T20:33:37.5402868Z         T: int,
2025-05-07T20:33:37.5403072Z         D: int,
2025-05-07T20:33:37.5403293Z         scale_ub: Optional[float],
2025-05-07T20:33:37.5403576Z         contiguous: bool,
2025-05-07T20:33:37.5403826Z         compiled: bool,
2025-05-07T20:33:37.5404050Z     ) -> None:
2025-05-07T20:33:37.5404268Z         torch.manual_seed(2025)
2025-05-07T20:33:37.5404518Z     
2025-05-07T20:33:37.5404793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.5405138Z     
2025-05-07T20:33:37.5405341Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.5405630Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.5405937Z         x = x_sign * x_clamp
2025-05-07T20:33:37.5406186Z         x0 = x[:, :D]
2025-05-07T20:33:37.5406448Z         x1 = x[:, D:]
2025-05-07T20:33:37.5406662Z     
2025-05-07T20:33:37.5406850Z         if contiguous:
2025-05-07T20:33:37.5407092Z             x0 = x0.contiguous()
2025-05-07T20:33:37.5407347Z             x1 = x1.contiguous()
2025-05-07T20:33:37.5407594Z     
2025-05-07T20:33:37.5407788Z         if scale_ub is not None:
2025-05-07T20:33:37.5408059Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.5408394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.5408709Z             )
2025-05-07T20:33:37.5408899Z         else:
2025-05-07T20:33:37.5409115Z             scale_ub_tensor = None
2025-05-07T20:33:37.5409368Z     
2025-05-07T20:33:37.5409601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.5409913Z             op = silu_mul_quant
2025-05-07T20:33:37.5410167Z             if compiled:
2025-05-07T20:33:37.5410413Z                 op = torch.compile(op)
2025-05-07T20:33:37.5410711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.5410989Z     
2025-05-07T20:33:37.5411184Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.5411347Z 
2025-05-07T20:33:37.5411450Z moe/activation_test.py:117: 
2025-05-07T20:33:37.5411747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.5412082Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.5412361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.5412916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.5413475Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.5414135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.5414820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.5415358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.5416050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.5422012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.5422598Z     kernel = self.compile(
2025-05-07T20:33:37.5423145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.5423806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.5424211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.5424441Z 
2025-05-07T20:33:37.5424652Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf123f0>
2025-05-07T20:33:37.5425800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.5427216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed2b60>}
2025-05-07T20:33:37.5428659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.5429688Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb351630>
2025-05-07T20:33:37.5429979Z 
2025-05-07T20:33:37.5430151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.5430677Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.5431151Z                            module_map=module_map)
2025-05-07T20:33:37.5431525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.5431921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.5432187Z E       ^
2025-05-07T20:33:37.5432652Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.5433111Z 
2025-05-07T20:33:37.5433530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.5434043Z 
2025-05-07T20:33:37.5434147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.5434564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.5434968Z     T=1,
2025-05-07T20:33:37.5435153Z     D=7168,
2025-05-07T20:33:37.5435344Z     scale_ub=1200.0,
2025-05-07T20:33:37.5435562Z     contiguous=False,
2025-05-07T20:33:37.5435847Z     compiled=True,
2025-05-07T20:33:37.5436052Z )
2025-05-07T20:33:37.6763771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.6764534Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.6764906Z 
2025-05-07T20:33:37.6765019Z     @given(
2025-05-07T20:33:37.6765267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.6765744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.6766045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.6766372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.6766699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.6766978Z     )
2025-05-07T20:33:37.6767327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.6767767Z     def test_silu_mul_quant(
2025-05-07T20:33:37.6768005Z         self,
2025-05-07T20:33:37.6768198Z         T: int,
2025-05-07T20:33:37.6768394Z         D: int,
2025-05-07T20:33:37.6768604Z         scale_ub: Optional[float],
2025-05-07T20:33:37.6768879Z         contiguous: bool,
2025-05-07T20:33:37.6769129Z         compiled: bool,
2025-05-07T20:33:37.6769358Z     ) -> None:
2025-05-07T20:33:37.6769571Z         torch.manual_seed(2025)
2025-05-07T20:33:37.6769821Z     
2025-05-07T20:33:37.6770100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.6770438Z     
2025-05-07T20:33:37.6770632Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.6770924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.6771230Z         x = x_sign * x_clamp
2025-05-07T20:33:37.6771473Z         x0 = x[:, :D]
2025-05-07T20:33:37.6771695Z         x1 = x[:, D:]
2025-05-07T20:33:37.6771901Z     
2025-05-07T20:33:37.6772090Z         if contiguous:
2025-05-07T20:33:37.6772325Z             x0 = x0.contiguous()
2025-05-07T20:33:37.6772578Z             x1 = x1.contiguous()
2025-05-07T20:33:37.6772821Z     
2025-05-07T20:33:37.6773018Z         if scale_ub is not None:
2025-05-07T20:33:37.6773285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.6773800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.6774117Z             )
2025-05-07T20:33:37.6774309Z         else:
2025-05-07T20:33:37.6774578Z             scale_ub_tensor = None
2025-05-07T20:33:37.6774827Z     
2025-05-07T20:33:37.6775056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.6775366Z             op = silu_mul_quant
2025-05-07T20:33:37.6775614Z             if compiled:
2025-05-07T20:33:37.6775864Z                 op = torch.compile(op)
2025-05-07T20:33:37.6776159Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.6776436Z     
2025-05-07T20:33:37.6776627Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.6776795Z 
2025-05-07T20:33:37.6776895Z moe/activation_test.py:117: 
2025-05-07T20:33:37.6777192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.6777520Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.6777797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.6778418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.6778981Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.6779636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.6780314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.6780850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.6781526Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.6782181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.6782709Z     kernel = self.compile(
2025-05-07T20:33:37.6783247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.6783903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.6784299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.6784533Z 
2025-05-07T20:33:37.6784740Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf13bc0>
2025-05-07T20:33:37.6785822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.6787194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbeaba60>}
2025-05-07T20:33:37.6788589Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.6789612Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad56e70>
2025-05-07T20:33:37.6789904Z 
2025-05-07T20:33:37.6790071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.6790592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.6791060Z                            module_map=module_map)
2025-05-07T20:33:37.6791419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.6791772Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.6792032Z E       ^
2025-05-07T20:33:37.6792490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.6792943Z 
2025-05-07T20:33:37.6793353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.6793949Z 
2025-05-07T20:33:37.6794054Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.6794469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.6794910Z     T=1,
2025-05-07T20:33:37.6795094Z     D=7168,
2025-05-07T20:33:37.6795287Z     scale_ub=None,
2025-05-07T20:33:37.6795497Z     contiguous=False,
2025-05-07T20:33:37.6795800Z     compiled=True,
2025-05-07T20:33:37.6796010Z )
2025-05-07T20:33:37.7668615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.7669346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.7669708Z 
2025-05-07T20:33:37.7669820Z     @given(
2025-05-07T20:33:37.7670143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.7670471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.7670776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.7671118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.7671452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.7671847Z     )
2025-05-07T20:33:37.7672200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.7672642Z     def test_silu_mul_quant(
2025-05-07T20:33:37.7672881Z         self,
2025-05-07T20:33:37.7673076Z         T: int,
2025-05-07T20:33:37.7673274Z         D: int,
2025-05-07T20:33:37.7673494Z         scale_ub: Optional[float],
2025-05-07T20:33:37.7673770Z         contiguous: bool,
2025-05-07T20:33:37.7674012Z         compiled: bool,
2025-05-07T20:33:37.7674232Z     ) -> None:
2025-05-07T20:33:37.7674452Z         torch.manual_seed(2025)
2025-05-07T20:33:37.7674693Z     
2025-05-07T20:33:37.7674958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.7675301Z     
2025-05-07T20:33:37.7675493Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.7675873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.7676187Z         x = x_sign * x_clamp
2025-05-07T20:33:37.7676433Z         x0 = x[:, :D]
2025-05-07T20:33:37.7676655Z         x1 = x[:, D:]
2025-05-07T20:33:37.7676858Z     
2025-05-07T20:33:37.7677047Z         if contiguous:
2025-05-07T20:33:37.7677279Z             x0 = x0.contiguous()
2025-05-07T20:33:37.7677541Z             x1 = x1.contiguous()
2025-05-07T20:33:37.7677808Z     
2025-05-07T20:33:37.7678021Z         if scale_ub is not None:
2025-05-07T20:33:37.7678289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.7678626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.7678935Z             )
2025-05-07T20:33:37.7679125Z         else:
2025-05-07T20:33:37.7679336Z             scale_ub_tensor = None
2025-05-07T20:33:37.7679585Z     
2025-05-07T20:33:37.7679813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.7680129Z             op = silu_mul_quant
2025-05-07T20:33:37.7680375Z             if compiled:
2025-05-07T20:33:37.7680626Z                 op = torch.compile(op)
2025-05-07T20:33:37.7680919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.7681197Z     
2025-05-07T20:33:37.7681390Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.7681670Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.7681962Z     
2025-05-07T20:33:37.7682199Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.7682528Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.7682821Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.7683134Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.7683484Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.7683791Z     
2025-05-07T20:33:37.7683991Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.7684184Z 
2025-05-07T20:33:37.7684288Z moe/activation_test.py:126: 
2025-05-07T20:33:37.7684653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.7685045Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.7685371Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.7686241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.7686991Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.7687530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.7688210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.7688893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.7689608Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.7690338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.7691017Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.7691612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.7692140Z     fn()
2025-05-07T20:33:37.7692648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.7693229Z     self.fn.run(
2025-05-07T20:33:37.7693691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.7694221Z     kernel = self.compile(
2025-05-07T20:33:37.7694754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.7695403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.7695803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.7696032Z 
2025-05-07T20:33:37.7696245Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5eea0>
2025-05-07T20:33:37.7697325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.7698697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd4c20>}
2025-05-07T20:33:37.7700043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.7701066Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad062f0>
2025-05-07T20:33:37.7701358Z 
2025-05-07T20:33:37.7701531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.7702050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.7702523Z                            module_map=module_map)
2025-05-07T20:33:37.7702885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.7703245Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.7703520Z E       ^
2025-05-07T20:33:37.7703985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.7704433Z 
2025-05-07T20:33:37.7704859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.7705367Z 
2025-05-07T20:33:37.7705475Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.7705935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.7706376Z     T=1,
2025-05-07T20:33:37.7706558Z     D=5120,
2025-05-07T20:33:37.7706754Z     scale_ub=1200.0,
2025-05-07T20:33:37.7706979Z     contiguous=False,
2025-05-07T20:33:37.7707244Z     compiled=True,
2025-05-07T20:33:37.7707443Z )
2025-05-07T20:33:37.9259530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.9260962Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.9261695Z 
2025-05-07T20:33:37.9261918Z     @given(
2025-05-07T20:33:37.9262429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.9263051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.9263662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.9264322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.9264968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.9265981Z     )
2025-05-07T20:33:37.9266706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.9267955Z     def test_silu_mul_quant(
2025-05-07T20:33:37.9268242Z         self,
2025-05-07T20:33:37.9268457Z         T: int,
2025-05-07T20:33:37.9268654Z         D: int,
2025-05-07T20:33:37.9268870Z         scale_ub: Optional[float],
2025-05-07T20:33:37.9269143Z         contiguous: bool,
2025-05-07T20:33:37.9269393Z         compiled: bool,
2025-05-07T20:33:37.9269613Z     ) -> None:
2025-05-07T20:33:37.9269836Z         torch.manual_seed(2025)
2025-05-07T20:33:37.9270089Z     
2025-05-07T20:33:37.9270357Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.9270707Z     
2025-05-07T20:33:37.9270915Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.9271208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.9271523Z         x = x_sign * x_clamp
2025-05-07T20:33:37.9271771Z         x0 = x[:, :D]
2025-05-07T20:33:37.9271986Z         x1 = x[:, D:]
2025-05-07T20:33:37.9272205Z     
2025-05-07T20:33:37.9272399Z         if contiguous:
2025-05-07T20:33:37.9272638Z             x0 = x0.contiguous()
2025-05-07T20:33:37.9272902Z             x1 = x1.contiguous()
2025-05-07T20:33:37.9273145Z     
2025-05-07T20:33:37.9273336Z         if scale_ub is not None:
2025-05-07T20:33:37.9273611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.9273948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.9274258Z             )
2025-05-07T20:33:37.9274448Z         else:
2025-05-07T20:33:37.9274662Z             scale_ub_tensor = None
2025-05-07T20:33:37.9274913Z     
2025-05-07T20:33:37.9275142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.9275456Z             op = silu_mul_quant
2025-05-07T20:33:37.9275790Z             if compiled:
2025-05-07T20:33:37.9276037Z                 op = torch.compile(op)
2025-05-07T20:33:37.9276334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.9276619Z     
2025-05-07T20:33:37.9276810Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.9276983Z 
2025-05-07T20:33:37.9277082Z moe/activation_test.py:117: 
2025-05-07T20:33:37.9277387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.9277727Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.9278010Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.9278568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.9279128Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.9279781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.9280468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.9281004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.9281755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.9282470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.9283063Z     kernel = self.compile(
2025-05-07T20:33:37.9283602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.9284251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.9284651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.9284887Z 
2025-05-07T20:33:37.9285095Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5e7b0>
2025-05-07T20:33:37.9286181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.9287607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd5ee0>}
2025-05-07T20:33:37.9289006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.9290030Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad1d3b0>
2025-05-07T20:33:37.9290318Z 
2025-05-07T20:33:37.9290488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.9291013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.9291487Z                            module_map=module_map)
2025-05-07T20:33:37.9291850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.9292209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.9292471Z E       ^
2025-05-07T20:33:37.9292937Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.9293392Z 
2025-05-07T20:33:37.9293810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.9294317Z 
2025-05-07T20:33:37.9294426Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.9294845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.9295257Z     T=1,
2025-05-07T20:33:37.9295443Z     D=5120,
2025-05-07T20:33:37.9295637Z     scale_ub=1200.0,
2025-05-07T20:33:37.9295871Z     contiguous=False,
2025-05-07T20:33:37.9296098Z     compiled=False,
2025-05-07T20:33:37.9296302Z )
2025-05-07T20:33:37.9296622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.9297113Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.9297382Z 
2025-05-07T20:33:37.9297466Z     @given(
2025-05-07T20:33:37.9297699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.9298041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.9298375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.9298706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.9299041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.9299331Z     )
2025-05-07T20:33:37.9299674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.9300118Z     def test_silu_mul_quant(
2025-05-07T20:33:37.9300368Z         self,
2025-05-07T20:33:37.9300562Z         T: int,
2025-05-07T20:33:37.9300768Z         D: int,
2025-05-07T20:33:37.9300994Z         scale_ub: Optional[float],
2025-05-07T20:33:37.9301264Z         contiguous: bool,
2025-05-07T20:33:37.9301508Z         compiled: bool,
2025-05-07T20:33:37.9301819Z     ) -> None:
2025-05-07T20:33:37.9302039Z         torch.manual_seed(2025)
2025-05-07T20:33:37.9302277Z     
2025-05-07T20:33:37.9302551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.9302938Z     
2025-05-07T20:33:37.9303129Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.9303421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.9303730Z         x = x_sign * x_clamp
2025-05-07T20:33:37.9303968Z         x0 = x[:, :D]
2025-05-07T20:33:37.9304185Z         x1 = x[:, D:]
2025-05-07T20:33:37.9304394Z     
2025-05-07T20:33:37.9304575Z         if contiguous:
2025-05-07T20:33:37.9304813Z             x0 = x0.contiguous()
2025-05-07T20:33:37.9305073Z             x1 = x1.contiguous()
2025-05-07T20:33:37.9305307Z     
2025-05-07T20:33:37.9305503Z         if scale_ub is not None:
2025-05-07T20:33:37.9305782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.9306120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.9306446Z             )
2025-05-07T20:33:37.9306644Z         else:
2025-05-07T20:33:37.9306904Z             scale_ub_tensor = None
2025-05-07T20:33:37.9307163Z     
2025-05-07T20:33:37.9307400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.9307721Z             op = silu_mul_quant
2025-05-07T20:33:37.9307975Z             if compiled:
2025-05-07T20:33:37.9308221Z                 op = torch.compile(op)
2025-05-07T20:33:37.9308516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.9308792Z     
2025-05-07T20:33:37.9308983Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.9309151Z 
2025-05-07T20:33:37.9309261Z moe/activation_test.py:117: 
2025-05-07T20:33:37.9309555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.9309888Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.9310173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.9310861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.9311551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.9312085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.9312769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.9313426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.9313957Z     kernel = self.compile(
2025-05-07T20:33:37.9314501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.9315153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.9315545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.9315853Z 
2025-05-07T20:33:37.9316062Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5f830>
2025-05-07T20:33:37.9317150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.9318574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd6b60>}
2025-05-07T20:33:37.9319914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.9320941Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa5b770>
2025-05-07T20:33:37.9321233Z 
2025-05-07T20:33:37.9321400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.9322048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.9322519Z                            module_map=module_map)
2025-05-07T20:33:37.9322928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.9323289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.9323549Z E       ^
2025-05-07T20:33:37.9324012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.9324472Z 
2025-05-07T20:33:37.9324884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.9325393Z 
2025-05-07T20:33:37.9325504Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.9325913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.9326319Z     T=16384,
2025-05-07T20:33:37.9326521Z     D=5120,
2025-05-07T20:33:37.9332876Z     scale_ub=1200.0,
2025-05-07T20:33:37.9333120Z     contiguous=False,
2025-05-07T20:33:37.9333419Z     compiled=True,
2025-05-07T20:33:37.9333636Z )
2025-05-07T20:33:38.0201265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.0202050Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:38.0202453Z 
2025-05-07T20:33:38.0202572Z     @given(
2025-05-07T20:33:38.0202883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.0203312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.0203624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.0203959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.0204291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.0204573Z     )
2025-05-07T20:33:38.0204929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.0205384Z     def test_silu_mul_quant(
2025-05-07T20:33:38.0205633Z         self,
2025-05-07T20:33:38.0205831Z         T: int,
2025-05-07T20:33:38.0206034Z         D: int,
2025-05-07T20:33:38.0206260Z         scale_ub: Optional[float],
2025-05-07T20:33:38.0206536Z         contiguous: bool,
2025-05-07T20:33:38.0206777Z         compiled: bool,
2025-05-07T20:33:38.0207002Z     ) -> None:
2025-05-07T20:33:38.0207220Z         torch.manual_seed(2025)
2025-05-07T20:33:38.0207472Z     
2025-05-07T20:33:38.0207746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.0208091Z     
2025-05-07T20:33:38.0208289Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.0208583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.0208891Z         x = x_sign * x_clamp
2025-05-07T20:33:38.0209135Z         x0 = x[:, :D]
2025-05-07T20:33:38.0209358Z         x1 = x[:, D:]
2025-05-07T20:33:38.0209570Z     
2025-05-07T20:33:38.0209758Z         if contiguous:
2025-05-07T20:33:38.0210003Z             x0 = x0.contiguous()
2025-05-07T20:33:38.0210262Z             x1 = x1.contiguous()
2025-05-07T20:33:38.0210510Z     
2025-05-07T20:33:38.0210706Z         if scale_ub is not None:
2025-05-07T20:33:38.0210984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.0211318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.0211634Z             )
2025-05-07T20:33:38.0211830Z         else:
2025-05-07T20:33:38.0212039Z             scale_ub_tensor = None
2025-05-07T20:33:38.0212296Z     
2025-05-07T20:33:38.0212536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.0212851Z             op = silu_mul_quant
2025-05-07T20:33:38.0213102Z             if compiled:
2025-05-07T20:33:38.0213355Z                 op = torch.compile(op)
2025-05-07T20:33:38.0213649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.0213934Z     
2025-05-07T20:33:38.0214133Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.0214305Z 
2025-05-07T20:33:38.0214584Z moe/activation_test.py:117: 
2025-05-07T20:33:38.0214897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0215234Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.0215577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.0216137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.0216710Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.0217379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.0218230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.0218907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.0219769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.0220515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.0221109Z     kernel = self.compile(
2025-05-07T20:33:38.0221657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.0222317Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.0222723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0222956Z 
2025-05-07T20:33:38.0223174Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5ff80>
2025-05-07T20:33:38.0224264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.0225648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c8220>}
2025-05-07T20:33:38.0227000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.0228203Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb449c70>
2025-05-07T20:33:38.0228571Z 
2025-05-07T20:33:38.0228780Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.0229438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.0229956Z                            module_map=module_map)
2025-05-07T20:33:38.0230318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.0230677Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.0230941Z E       ^
2025-05-07T20:33:38.0231412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.0231874Z 
2025-05-07T20:33:38.0232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.0232807Z 
2025-05-07T20:33:38.0232914Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.0233332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.0233734Z     T=2048,
2025-05-07T20:33:38.0233928Z     D=7168,
2025-05-07T20:33:38.0234122Z     scale_ub=1200.0,
2025-05-07T20:33:38.0234372Z     contiguous=False,
2025-05-07T20:33:38.0234594Z     compiled=True,
2025-05-07T20:33:38.0234803Z )
2025-05-07T20:33:38.0235127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.0235619Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:38.0236039Z 
2025-05-07T20:33:38.0236119Z     @given(
2025-05-07T20:33:38.0236504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.0236824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.0237136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.0237518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.0237855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.0238143Z     )
2025-05-07T20:33:38.0238495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.0238939Z     def test_silu_mul_quant(
2025-05-07T20:33:38.0239179Z         self,
2025-05-07T20:33:38.0239380Z         T: int,
2025-05-07T20:33:38.0239582Z         D: int,
2025-05-07T20:33:38.0239798Z         scale_ub: Optional[float],
2025-05-07T20:33:38.0240075Z         contiguous: bool,
2025-05-07T20:33:38.0240318Z         compiled: bool,
2025-05-07T20:33:38.0240548Z     ) -> None:
2025-05-07T20:33:38.0240764Z         torch.manual_seed(2025)
2025-05-07T20:33:38.0241008Z     
2025-05-07T20:33:38.0241291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.0241680Z     
2025-05-07T20:33:38.0241881Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.0242176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.0242488Z         x = x_sign * x_clamp
2025-05-07T20:33:38.0242729Z         x0 = x[:, :D]
2025-05-07T20:33:38.0242947Z         x1 = x[:, D:]
2025-05-07T20:33:38.0243155Z     
2025-05-07T20:33:38.0243345Z         if contiguous:
2025-05-07T20:33:38.0243581Z             x0 = x0.contiguous()
2025-05-07T20:33:38.0243841Z             x1 = x1.contiguous()
2025-05-07T20:33:38.0244083Z     
2025-05-07T20:33:38.0244281Z         if scale_ub is not None:
2025-05-07T20:33:38.0244552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.0244890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.0245205Z             )
2025-05-07T20:33:38.0245399Z         else:
2025-05-07T20:33:38.0245616Z             scale_ub_tensor = None
2025-05-07T20:33:38.0245878Z     
2025-05-07T20:33:38.0246116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.0246430Z             op = silu_mul_quant
2025-05-07T20:33:38.0246686Z             if compiled:
2025-05-07T20:33:38.0246936Z                 op = torch.compile(op)
2025-05-07T20:33:38.0247230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.0247512Z     
2025-05-07T20:33:38.0247716Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.0247906Z 
2025-05-07T20:33:38.0248031Z moe/activation_test.py:117: 
2025-05-07T20:33:38.0248330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0248665Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.0248944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.0249501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.0250064Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.0250729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.0251418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.0251962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.0252646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.0253311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.0253841Z     kernel = self.compile(
2025-05-07T20:33:38.0254382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.0255041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.0255486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0255761Z 
2025-05-07T20:33:38.0255973Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e1e20>
2025-05-07T20:33:38.0257058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.0258530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c8f40>}
2025-05-07T20:33:38.0259876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.0260896Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0638c70>
2025-05-07T20:33:38.0261190Z 
2025-05-07T20:33:38.0261362Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.0261929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.0262417Z                            module_map=module_map)
2025-05-07T20:33:38.0262779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.0263137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.0263408Z E       ^
2025-05-07T20:33:38.0263878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.0264339Z 
2025-05-07T20:33:38.0264751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.0265268Z 
2025-05-07T20:33:38.1424228Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.1425432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.1426624Z     T=1,
2025-05-07T20:33:38.1427130Z     D=5120,
2025-05-07T20:33:38.1427650Z     scale_ub=None,
2025-05-07T20:33:38.1428121Z     contiguous=False,
2025-05-07T20:33:38.1428442Z     compiled=False,
2025-05-07T20:33:38.1428646Z )
2025-05-07T20:33:38.1428962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.1429455Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:38.1429722Z 
2025-05-07T20:33:38.1429801Z     @given(
2025-05-07T20:33:38.1430040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.1430349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.1430654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.1430987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.1431307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.1431596Z     )
2025-05-07T20:33:38.1431939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.1432378Z     def test_silu_mul_quant(
2025-05-07T20:33:38.1432617Z         self,
2025-05-07T20:33:38.1432811Z         T: int,
2025-05-07T20:33:38.1433013Z         D: int,
2025-05-07T20:33:38.1433234Z         scale_ub: Optional[float],
2025-05-07T20:33:38.1433505Z         contiguous: bool,
2025-05-07T20:33:38.1433751Z         compiled: bool,
2025-05-07T20:33:38.1433972Z     ) -> None:
2025-05-07T20:33:38.1434186Z         torch.manual_seed(2025)
2025-05-07T20:33:38.1434429Z     
2025-05-07T20:33:38.1434694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.1435034Z     
2025-05-07T20:33:38.1435228Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.1435511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.1435919Z         x = x_sign * x_clamp
2025-05-07T20:33:38.1436162Z         x0 = x[:, :D]
2025-05-07T20:33:38.1436375Z         x1 = x[:, D:]
2025-05-07T20:33:38.1436580Z     
2025-05-07T20:33:38.1436980Z         if contiguous:
2025-05-07T20:33:38.1437206Z             x0 = x0.contiguous()
2025-05-07T20:33:38.1437466Z             x1 = x1.contiguous()
2025-05-07T20:33:38.1437709Z     
2025-05-07T20:33:38.1438002Z         if scale_ub is not None:
2025-05-07T20:33:38.1438275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.1438608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.1438914Z             )
2025-05-07T20:33:38.1439107Z         else:
2025-05-07T20:33:38.1439314Z             scale_ub_tensor = None
2025-05-07T20:33:38.1439563Z     
2025-05-07T20:33:38.1439788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.1440099Z             op = silu_mul_quant
2025-05-07T20:33:38.1440352Z             if compiled:
2025-05-07T20:33:38.1440591Z                 op = torch.compile(op)
2025-05-07T20:33:38.1440880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.1441148Z     
2025-05-07T20:33:38.1441342Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.1441509Z 
2025-05-07T20:33:38.1441672Z moe/activation_test.py:117: 
2025-05-07T20:33:38.1441969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1442301Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.1442576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.1443256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.1443944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.1444472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.1445151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.1445837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.1446366Z     kernel = self.compile(
2025-05-07T20:33:38.1446907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.1447552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.1447993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1448229Z 
2025-05-07T20:33:38.1448444Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e3080>
2025-05-07T20:33:38.1449524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.1450888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c9ee0>}
2025-05-07T20:33:38.1452233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.1453254Z context = <triton._C.libtriton.ir.context object at 0x7f2bfac68bf0>
2025-05-07T20:33:38.1453541Z 
2025-05-07T20:33:38.1453713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.1454225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.1454693Z                            module_map=module_map)
2025-05-07T20:33:38.1455058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.1455410Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.1455664Z E       ^
2025-05-07T20:33:38.1456121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.1456573Z 
2025-05-07T20:33:38.1457057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.1457603Z 
2025-05-07T20:33:38.1457709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.1458161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.1458560Z     T=4096,
2025-05-07T20:33:38.1458747Z     D=7168,
2025-05-07T20:33:38.1458934Z     scale_ub=1200.0,
2025-05-07T20:33:38.1459155Z     contiguous=False,
2025-05-07T20:33:38.1459375Z     compiled=False,
2025-05-07T20:33:38.1459573Z )
2025-05-07T20:33:38.1459886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.1460380Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:38.1460652Z 
2025-05-07T20:33:38.1460730Z     @given(
2025-05-07T20:33:38.1460961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.1461270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.1461586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.1461952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.1462283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.1462575Z     )
2025-05-07T20:33:38.1462916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.1463353Z     def test_silu_mul_quant(
2025-05-07T20:33:38.1463590Z         self,
2025-05-07T20:33:38.1463779Z         T: int,
2025-05-07T20:33:38.1463971Z         D: int,
2025-05-07T20:33:38.1464187Z         scale_ub: Optional[float],
2025-05-07T20:33:38.1464455Z         contiguous: bool,
2025-05-07T20:33:38.1464690Z         compiled: bool,
2025-05-07T20:33:38.1464908Z     ) -> None:
2025-05-07T20:33:38.1465120Z         torch.manual_seed(2025)
2025-05-07T20:33:38.1465360Z     
2025-05-07T20:33:38.1465930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.1466266Z     
2025-05-07T20:33:38.1466464Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.1466753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.1467066Z         x = x_sign * x_clamp
2025-05-07T20:33:38.1467300Z         x0 = x[:, :D]
2025-05-07T20:33:38.1467516Z         x1 = x[:, D:]
2025-05-07T20:33:38.1467721Z     
2025-05-07T20:33:38.1467904Z         if contiguous:
2025-05-07T20:33:38.1468167Z             x0 = x0.contiguous()
2025-05-07T20:33:38.1468437Z             x1 = x1.contiguous()
2025-05-07T20:33:38.1468673Z     
2025-05-07T20:33:38.1468863Z         if scale_ub is not None:
2025-05-07T20:33:38.1469135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.1469466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.1469771Z             )
2025-05-07T20:33:38.1469964Z         else:
2025-05-07T20:33:38.1470166Z             scale_ub_tensor = None
2025-05-07T20:33:38.1470415Z     
2025-05-07T20:33:38.1470650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.1470961Z             op = silu_mul_quant
2025-05-07T20:33:38.1471214Z             if compiled:
2025-05-07T20:33:38.1471458Z                 op = torch.compile(op)
2025-05-07T20:33:38.1471750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.1472021Z     
2025-05-07T20:33:38.1472215Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.1472376Z 
2025-05-07T20:33:38.1472479Z moe/activation_test.py:117: 
2025-05-07T20:33:38.1472766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1473101Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.1473381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.1474056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.1474745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.1475350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.1476124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.1476778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.1477368Z     kernel = self.compile(
2025-05-07T20:33:38.1477904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.1478552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.1478943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1479173Z 
2025-05-07T20:33:38.1479377Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e2c00>
2025-05-07T20:33:38.1480457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.1481880Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4cb420>}
2025-05-07T20:33:38.1483217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.1484243Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa2ad30>
2025-05-07T20:33:38.1484535Z 
2025-05-07T20:33:38.1484702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.1485225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.1485691Z                            module_map=module_map)
2025-05-07T20:33:38.1486061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.1486420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.1486686Z E       ^
2025-05-07T20:33:38.1487154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.1487608Z 
2025-05-07T20:33:38.1488017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.1488524Z 
2025-05-07T20:33:38.1488634Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.1489041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.1489444Z     T=16384,
2025-05-07T20:33:38.1489642Z     D=7168,
2025-05-07T20:33:38.1489832Z     scale_ub=None,
2025-05-07T20:33:38.1490045Z     contiguous=True,
2025-05-07T20:33:38.1490269Z     compiled=True,
2025-05-07T20:33:38.1490467Z )
2025-05-07T20:33:38.3243782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.3245326Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:38.3246109Z 
2025-05-07T20:33:38.3246326Z     @given(
2025-05-07T20:33:38.3246933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.3247689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.3248140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.3248506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.3248829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.3249114Z     )
2025-05-07T20:33:38.3249469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.3249908Z     def test_silu_mul_quant(
2025-05-07T20:33:38.3250152Z         self,
2025-05-07T20:33:38.3250353Z         T: int,
2025-05-07T20:33:38.3250547Z         D: int,
2025-05-07T20:33:38.3250769Z         scale_ub: Optional[float],
2025-05-07T20:33:38.3251040Z         contiguous: bool,
2025-05-07T20:33:38.3251495Z         compiled: bool,
2025-05-07T20:33:38.3251721Z     ) -> None:
2025-05-07T20:33:38.3251941Z         torch.manual_seed(2025)
2025-05-07T20:33:38.3252188Z     
2025-05-07T20:33:38.3252524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.3252876Z     
2025-05-07T20:33:38.3253068Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.3253356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.3253682Z         x = x_sign * x_clamp
2025-05-07T20:33:38.3253924Z         x0 = x[:, :D]
2025-05-07T20:33:38.3254136Z         x1 = x[:, D:]
2025-05-07T20:33:38.3254347Z     
2025-05-07T20:33:38.3254539Z         if contiguous:
2025-05-07T20:33:38.3254775Z             x0 = x0.contiguous()
2025-05-07T20:33:38.3255044Z             x1 = x1.contiguous()
2025-05-07T20:33:38.3255290Z     
2025-05-07T20:33:38.3255481Z         if scale_ub is not None:
2025-05-07T20:33:38.3261984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.3262347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.3262755Z             )
2025-05-07T20:33:38.3262948Z         else:
2025-05-07T20:33:38.3263160Z             scale_ub_tensor = None
2025-05-07T20:33:38.3263421Z     
2025-05-07T20:33:38.3263650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.3263963Z             op = silu_mul_quant
2025-05-07T20:33:38.3264221Z             if compiled:
2025-05-07T20:33:38.3264466Z                 op = torch.compile(op)
2025-05-07T20:33:38.3264759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3265042Z     
2025-05-07T20:33:38.3265238Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.3265664Z 
2025-05-07T20:33:38.3265769Z moe/activation_test.py:117: 
2025-05-07T20:33:38.3266080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3266416Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.3266708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3267280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.3267837Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.3268495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.3269182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.3269724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.3270402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.3271066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.3271607Z     kernel = self.compile(
2025-05-07T20:33:38.3272153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.3272806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.3273366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3273605Z 
2025-05-07T20:33:38.3273820Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa484a0>
2025-05-07T20:33:38.3274909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.3276366Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5c540>}
2025-05-07T20:33:38.3277801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.3278896Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa0a430>
2025-05-07T20:33:38.3279185Z 
2025-05-07T20:33:38.3279357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.3279937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.3280416Z                            module_map=module_map)
2025-05-07T20:33:38.3280785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.3281137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.3281400Z E       ^
2025-05-07T20:33:38.3281868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.3282323Z 
2025-05-07T20:33:38.3282739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.3283258Z 
2025-05-07T20:33:38.3283363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.3283847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.3284258Z     T=4096,
2025-05-07T20:33:38.3284453Z     D=5120,
2025-05-07T20:33:38.3284646Z     scale_ub=None,
2025-05-07T20:33:38.3284864Z     contiguous=False,
2025-05-07T20:33:38.3285093Z     compiled=True,
2025-05-07T20:33:38.3285293Z )
2025-05-07T20:33:38.3285615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.3286108Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:38.3286381Z 
2025-05-07T20:33:38.3286460Z     @given(
2025-05-07T20:33:38.3286695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.3287016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.3287327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.3287663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.3288003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.3288295Z     )
2025-05-07T20:33:38.3288639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.3289088Z     def test_silu_mul_quant(
2025-05-07T20:33:38.3289340Z         self,
2025-05-07T20:33:38.3289542Z         T: int,
2025-05-07T20:33:38.3289749Z         D: int,
2025-05-07T20:33:38.3289973Z         scale_ub: Optional[float],
2025-05-07T20:33:38.3290245Z         contiguous: bool,
2025-05-07T20:33:38.3290490Z         compiled: bool,
2025-05-07T20:33:38.3290715Z     ) -> None:
2025-05-07T20:33:38.3290930Z         torch.manual_seed(2025)
2025-05-07T20:33:38.3291181Z     
2025-05-07T20:33:38.3291459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.3291800Z     
2025-05-07T20:33:38.3291999Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.3292293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.3292607Z         x = x_sign * x_clamp
2025-05-07T20:33:38.3292852Z         x0 = x[:, :D]
2025-05-07T20:33:38.3293073Z         x1 = x[:, D:]
2025-05-07T20:33:38.3293280Z     
2025-05-07T20:33:38.3293474Z         if contiguous:
2025-05-07T20:33:38.3293710Z             x0 = x0.contiguous()
2025-05-07T20:33:38.3293973Z             x1 = x1.contiguous()
2025-05-07T20:33:38.3294212Z     
2025-05-07T20:33:38.3294408Z         if scale_ub is not None:
2025-05-07T20:33:38.3294687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.3295019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.3295331Z             )
2025-05-07T20:33:38.3295528Z         else:
2025-05-07T20:33:38.3295738Z             scale_ub_tensor = None
2025-05-07T20:33:38.3295997Z     
2025-05-07T20:33:38.3296229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.3296545Z             op = silu_mul_quant
2025-05-07T20:33:38.3296800Z             if compiled:
2025-05-07T20:33:38.3297147Z                 op = torch.compile(op)
2025-05-07T20:33:38.3297444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3297721Z     
2025-05-07T20:33:38.3297916Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.3298126Z 
2025-05-07T20:33:38.3298232Z moe/activation_test.py:117: 
2025-05-07T20:33:38.3298525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3298860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.3299145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3299698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.3300261Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.3300918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.3301606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.3302186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.3302870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.3303538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.3304069Z     kernel = self.compile(
2025-05-07T20:33:38.3304610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.3305265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.3305666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3305896Z 
2025-05-07T20:33:38.3306110Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa48fb0>
2025-05-07T20:33:38.3307201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.3308596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5d260>}
2025-05-07T20:33:38.3309947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.3310973Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaab1bf0>
2025-05-07T20:33:38.3311268Z 
2025-05-07T20:33:38.3311436Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.3311960Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.3312433Z                            module_map=module_map)
2025-05-07T20:33:38.3312800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.3313161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.3313426Z E       ^
2025-05-07T20:33:38.3313891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.3314346Z 
2025-05-07T20:33:38.3314758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.3315272Z 
2025-05-07T20:33:38.6386067Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.6386689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.6387298Z     T=4096,
2025-05-07T20:33:38.6387550Z     D=5120,
2025-05-07T20:33:38.6387813Z     scale_ub=1200.0,
2025-05-07T20:33:38.6388149Z     contiguous=False,
2025-05-07T20:33:38.6388732Z     compiled=False,
2025-05-07T20:33:38.6389141Z )
2025-05-07T20:33:38.6390054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.6390985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:38.6391493Z 
2025-05-07T20:33:38.6391752Z     @given(
2025-05-07T20:33:38.6392168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.6392813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.6393434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.6394028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.6394626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.6395147Z     )
2025-05-07T20:33:38.6395899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.6396706Z     def test_silu_mul_quant(
2025-05-07T20:33:38.6397149Z         self,
2025-05-07T20:33:38.6397505Z         T: int,
2025-05-07T20:33:38.6397853Z         D: int,
2025-05-07T20:33:38.6398256Z         scale_ub: Optional[float],
2025-05-07T20:33:38.6398580Z         contiguous: bool,
2025-05-07T20:33:38.6398883Z         compiled: bool,
2025-05-07T20:33:38.6399111Z     ) -> None:
2025-05-07T20:33:38.6399333Z         torch.manual_seed(2025)
2025-05-07T20:33:38.6399574Z     
2025-05-07T20:33:38.6399844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.6400192Z     
2025-05-07T20:33:38.6400384Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.6400676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.6400988Z         x = x_sign * x_clamp
2025-05-07T20:33:38.6401220Z         x0 = x[:, :D]
2025-05-07T20:33:38.6401438Z         x1 = x[:, D:]
2025-05-07T20:33:38.6401650Z     
2025-05-07T20:33:38.6401832Z         if contiguous:
2025-05-07T20:33:38.6402069Z             x0 = x0.contiguous()
2025-05-07T20:33:38.6402327Z             x1 = x1.contiguous()
2025-05-07T20:33:38.6402562Z     
2025-05-07T20:33:38.6402753Z         if scale_ub is not None:
2025-05-07T20:33:38.6403033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.6403374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.6403681Z             )
2025-05-07T20:33:38.6403881Z         else:
2025-05-07T20:33:38.6404097Z             scale_ub_tensor = None
2025-05-07T20:33:38.6404343Z     
2025-05-07T20:33:38.6404570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.6404881Z             op = silu_mul_quant
2025-05-07T20:33:38.6405129Z             if compiled:
2025-05-07T20:33:38.6405389Z                 op = torch.compile(op)
2025-05-07T20:33:38.6405694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.6405962Z     
2025-05-07T20:33:38.6406155Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.6406322Z 
2025-05-07T20:33:38.6406430Z moe/activation_test.py:117: 
2025-05-07T20:33:38.6406730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.6407071Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.6407361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.6408060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.6408789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.6409322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.6410002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.6410664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.6411186Z     kernel = self.compile(
2025-05-07T20:33:38.6411724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.6412376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.6412891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.6413128Z 
2025-05-07T20:33:38.6413334Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa4b560>
2025-05-07T20:33:38.6414452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.6415824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5e200>}
2025-05-07T20:33:38.6417166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.6418200Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa99d3f0>
2025-05-07T20:33:38.6418543Z 
2025-05-07T20:33:38.6418768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.6419290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.6419761Z                            module_map=module_map)
2025-05-07T20:33:38.6420115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.6420477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.6420739Z E       ^
2025-05-07T20:33:38.6421206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.6421662Z 
2025-05-07T20:33:38.6422072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.6422583Z 
2025-05-07T20:33:38.6422687Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.6423102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.6423500Z     T=4096,
2025-05-07T20:33:38.6423693Z     D=5120,
2025-05-07T20:33:38.6423889Z     scale_ub=1200.0,
2025-05-07T20:33:38.6424111Z     contiguous=False,
2025-05-07T20:33:38.6424337Z     compiled=True,
2025-05-07T20:33:38.6424543Z )
2025-05-07T20:33:38.6424854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.6425346Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:38.6425622Z 
2025-05-07T20:33:38.6425699Z     @given(
2025-05-07T20:33:38.6425928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.6426235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.6426540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.6426868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.6427189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.6427475Z     )
2025-05-07T20:33:38.6427823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.6428259Z     def test_silu_mul_quant(
2025-05-07T20:33:38.6428499Z         self,
2025-05-07T20:33:38.6428696Z         T: int,
2025-05-07T20:33:38.6428890Z         D: int,
2025-05-07T20:33:38.6429108Z         scale_ub: Optional[float],
2025-05-07T20:33:38.6429378Z         contiguous: bool,
2025-05-07T20:33:38.6429616Z         compiled: bool,
2025-05-07T20:33:38.6429833Z     ) -> None:
2025-05-07T20:33:38.6430049Z         torch.manual_seed(2025)
2025-05-07T20:33:38.6430296Z     
2025-05-07T20:33:38.6430565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.6430907Z     
2025-05-07T20:33:38.6431100Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.6431384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.6431694Z         x = x_sign * x_clamp
2025-05-07T20:33:38.6431936Z         x0 = x[:, :D]
2025-05-07T20:33:38.6432247Z         x1 = x[:, D:]
2025-05-07T20:33:38.6432458Z     
2025-05-07T20:33:38.6432654Z         if contiguous:
2025-05-07T20:33:38.6432878Z             x0 = x0.contiguous()
2025-05-07T20:33:38.6433136Z             x1 = x1.contiguous()
2025-05-07T20:33:38.6433419Z     
2025-05-07T20:33:38.6433606Z         if scale_ub is not None:
2025-05-07T20:33:38.6433877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.6434210Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.6434512Z             )
2025-05-07T20:33:38.6434718Z         else:
2025-05-07T20:33:38.6434928Z             scale_ub_tensor = None
2025-05-07T20:33:38.6435177Z     
2025-05-07T20:33:38.6435403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.6435758Z             op = silu_mul_quant
2025-05-07T20:33:38.6436008Z             if compiled:
2025-05-07T20:33:38.6436248Z                 op = torch.compile(op)
2025-05-07T20:33:38.6436541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.6436819Z     
2025-05-07T20:33:38.6437060Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.6437232Z 
2025-05-07T20:33:38.6437329Z moe/activation_test.py:117: 
2025-05-07T20:33:38.6437623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.6437960Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.6438275Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.6438837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.6439391Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.6440038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.6440721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.6441257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.6441937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.6442593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.6443123Z     kernel = self.compile(
2025-05-07T20:33:38.6443656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.6444302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.6444691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.6444923Z 
2025-05-07T20:33:38.6445129Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa49a30>
2025-05-07T20:33:38.6446203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.6447574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5f2e0>}
2025-05-07T20:33:38.6448962Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.6449982Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa9f7bb0>
2025-05-07T20:33:38.6450273Z 
2025-05-07T20:33:38.6450438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.6450954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.6451418Z                            module_map=module_map)
2025-05-07T20:33:38.6451782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.6452229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.6452484Z E       ^
2025-05-07T20:33:38.6452947Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.6453442Z 
2025-05-07T20:33:38.6453857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.6454364Z 
2025-05-07T20:33:38.7601698Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.7602324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.7602890Z     T=2048,
2025-05-07T20:33:38.7603165Z     D=7168,
2025-05-07T20:33:38.7603436Z     scale_ub=1200.0,
2025-05-07T20:33:38.7603735Z     contiguous=False,
2025-05-07T20:33:38.7604048Z     compiled=False,
2025-05-07T20:33:38.7604320Z )
2025-05-07T20:33:38.7604640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.7605146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:38.7605426Z 
2025-05-07T20:33:38.7605618Z     @given(
2025-05-07T20:33:38.7605847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.7606167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.7606478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.7606817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.7607145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.7607438Z     )
2025-05-07T20:33:38.7607784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.7608221Z     def test_silu_mul_quant(
2025-05-07T20:33:38.7608462Z         self,
2025-05-07T20:33:38.7608660Z         T: int,
2025-05-07T20:33:38.7608852Z         D: int,
2025-05-07T20:33:38.7609070Z         scale_ub: Optional[float],
2025-05-07T20:33:38.7609338Z         contiguous: bool,
2025-05-07T20:33:38.7609578Z         compiled: bool,
2025-05-07T20:33:38.7609813Z     ) -> None:
2025-05-07T20:33:38.7610064Z         torch.manual_seed(2025)
2025-05-07T20:33:38.7610306Z     
2025-05-07T20:33:38.7610577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.7610918Z     
2025-05-07T20:33:38.7611111Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.7611399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.7611708Z         x = x_sign * x_clamp
2025-05-07T20:33:38.7611946Z         x0 = x[:, :D]
2025-05-07T20:33:38.7612164Z         x1 = x[:, D:]
2025-05-07T20:33:38.7612366Z     
2025-05-07T20:33:38.7612554Z         if contiguous:
2025-05-07T20:33:38.7612794Z             x0 = x0.contiguous()
2025-05-07T20:33:38.7613059Z             x1 = x1.contiguous()
2025-05-07T20:33:38.7613303Z     
2025-05-07T20:33:38.7613501Z         if scale_ub is not None:
2025-05-07T20:33:38.7613772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.7614115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.7614426Z             )
2025-05-07T20:33:38.7614623Z         else:
2025-05-07T20:33:38.7614832Z             scale_ub_tensor = None
2025-05-07T20:33:38.7615084Z     
2025-05-07T20:33:38.7615319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.7615633Z             op = silu_mul_quant
2025-05-07T20:33:38.7615885Z             if compiled:
2025-05-07T20:33:38.7616132Z                 op = torch.compile(op)
2025-05-07T20:33:38.7616419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7616698Z     
2025-05-07T20:33:38.7616892Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.7617057Z 
2025-05-07T20:33:38.7617168Z moe/activation_test.py:117: 
2025-05-07T20:33:38.7617458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7617790Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.7618076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7618839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.7619584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.7620176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.7620857Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.7621522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.7622054Z     kernel = self.compile(
2025-05-07T20:33:38.7622590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.7623237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.7623635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7623874Z 
2025-05-07T20:33:38.7624084Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cc3b0>
2025-05-07T20:33:38.7625208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.7626578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a42c0>}
2025-05-07T20:33:38.7627923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.7628952Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbb436b0>
2025-05-07T20:33:38.7629240Z 
2025-05-07T20:33:38.7629411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.7629937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.7630404Z                            module_map=module_map)
2025-05-07T20:33:38.7630771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.7631128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.7631384Z E       ^
2025-05-07T20:33:38.7631852Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.7632300Z 
2025-05-07T20:33:38.7638279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.7638835Z 
2025-05-07T20:33:38.7638943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.7639353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.7639747Z     T=1,
2025-05-07T20:33:38.7639949Z     D=7168,
2025-05-07T20:33:38.7640137Z     scale_ub=None,
2025-05-07T20:33:38.7640348Z     contiguous=True,
2025-05-07T20:33:38.7640575Z     compiled=False,
2025-05-07T20:33:38.7640784Z )
2025-05-07T20:33:38.7641095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.7641579Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:38.7641847Z 
2025-05-07T20:33:38.7641928Z     @given(
2025-05-07T20:33:38.7642163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.7642473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.7642780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.7643106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.7643428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.7643718Z     )
2025-05-07T20:33:38.7644056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.7644611Z     def test_silu_mul_quant(
2025-05-07T20:33:38.7644852Z         self,
2025-05-07T20:33:38.7645043Z         T: int,
2025-05-07T20:33:38.7645239Z         D: int,
2025-05-07T20:33:38.7645453Z         scale_ub: Optional[float],
2025-05-07T20:33:38.7645790Z         contiguous: bool,
2025-05-07T20:33:38.7646040Z         compiled: bool,
2025-05-07T20:33:38.7646280Z     ) -> None:
2025-05-07T20:33:38.7646503Z         torch.manual_seed(2025)
2025-05-07T20:33:38.7646755Z     
2025-05-07T20:33:38.7647050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.7647429Z     
2025-05-07T20:33:38.7647626Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.7647955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.7648296Z         x = x_sign * x_clamp
2025-05-07T20:33:38.7648544Z         x0 = x[:, :D]
2025-05-07T20:33:38.7648770Z         x1 = x[:, D:]
2025-05-07T20:33:38.7648985Z     
2025-05-07T20:33:38.7649176Z         if contiguous:
2025-05-07T20:33:38.7649431Z             x0 = x0.contiguous()
2025-05-07T20:33:38.7649749Z             x1 = x1.contiguous()
2025-05-07T20:33:38.7650000Z     
2025-05-07T20:33:38.7650200Z         if scale_ub is not None:
2025-05-07T20:33:38.7650495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.7650858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.7651200Z             )
2025-05-07T20:33:38.7651401Z         else:
2025-05-07T20:33:38.7651616Z             scale_ub_tensor = None
2025-05-07T20:33:38.7651880Z     
2025-05-07T20:33:38.7652123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.7652438Z             op = silu_mul_quant
2025-05-07T20:33:38.7652680Z             if compiled:
2025-05-07T20:33:38.7652919Z                 op = torch.compile(op)
2025-05-07T20:33:38.7653210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7653475Z     
2025-05-07T20:33:38.7653669Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.7653837Z 
2025-05-07T20:33:38.7653943Z moe/activation_test.py:117: 
2025-05-07T20:33:38.7654233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7654567Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.7654844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7655533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.7656215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.7656752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.7657429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.7658086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.7658617Z     kernel = self.compile(
2025-05-07T20:33:38.7659160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.7659817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.7660206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7660442Z 
2025-05-07T20:33:38.7660651Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9ce570>
2025-05-07T20:33:38.7661735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.7663108Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a51c0>}
2025-05-07T20:33:38.7664496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.7665823Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb0b00f0>
2025-05-07T20:33:38.7666198Z 
2025-05-07T20:33:38.7666361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.7666884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.7667345Z                            module_map=module_map)
2025-05-07T20:33:38.7667706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.7668062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.7668316Z E       ^
2025-05-07T20:33:38.7668772Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.7669227Z 
2025-05-07T20:33:38.7669640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.7670147Z 
2025-05-07T20:33:38.7670319Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.7670735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.7671139Z     T=16384,
2025-05-07T20:33:38.7671332Z     D=7168,
2025-05-07T20:33:38.7671529Z     scale_ub=1200.0,
2025-05-07T20:33:38.7671744Z     contiguous=False,
2025-05-07T20:33:38.7671967Z     compiled=True,
2025-05-07T20:33:39.0089743Z )
2025-05-07T20:33:39.0090430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.0091199Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.0091599Z 
2025-05-07T20:33:39.0091712Z     @given(
2025-05-07T20:33:39.0092033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.0092385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.0092708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.0093046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.0093373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.0093665Z     )
2025-05-07T20:33:39.0094019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.0094468Z     def test_silu_mul_quant(
2025-05-07T20:33:39.0094706Z         self,
2025-05-07T20:33:39.0094905Z         T: int,
2025-05-07T20:33:39.0095107Z         D: int,
2025-05-07T20:33:39.0095327Z         scale_ub: Optional[float],
2025-05-07T20:33:39.0095609Z         contiguous: bool,
2025-05-07T20:33:39.0095856Z         compiled: bool,
2025-05-07T20:33:39.0096077Z     ) -> None:
2025-05-07T20:33:39.0096293Z         torch.manual_seed(2025)
2025-05-07T20:33:39.0096544Z     
2025-05-07T20:33:39.0096817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.0097169Z     
2025-05-07T20:33:39.0097365Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.0097662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.0097983Z         x = x_sign * x_clamp
2025-05-07T20:33:39.0098229Z         x0 = x[:, :D]
2025-05-07T20:33:39.0098442Z         x1 = x[:, D:]
2025-05-07T20:33:39.0098653Z     
2025-05-07T20:33:39.0098844Z         if contiguous:
2025-05-07T20:33:39.0099076Z             x0 = x0.contiguous()
2025-05-07T20:33:39.0099346Z             x1 = x1.contiguous()
2025-05-07T20:33:39.0099587Z     
2025-05-07T20:33:39.0099786Z         if scale_ub is not None:
2025-05-07T20:33:39.0100052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.0100388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.0100703Z             )
2025-05-07T20:33:39.0100894Z         else:
2025-05-07T20:33:39.0101111Z             scale_ub_tensor = None
2025-05-07T20:33:39.0101367Z     
2025-05-07T20:33:39.0101596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.0102102Z             op = silu_mul_quant
2025-05-07T20:33:39.0102356Z             if compiled:
2025-05-07T20:33:39.0102605Z                 op = torch.compile(op)
2025-05-07T20:33:39.0102904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0103280Z     
2025-05-07T20:33:39.0103472Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.0103641Z 
2025-05-07T20:33:39.0103744Z moe/activation_test.py:117: 
2025-05-07T20:33:39.0104041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0104376Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.0104654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0105210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.0105769Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.0106419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.0107108Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.0107705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.0108412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.0109099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.0109633Z     kernel = self.compile(
2025-05-07T20:33:39.0110173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.0110824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.0111232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0111464Z 
2025-05-07T20:33:39.0111672Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cdb20>
2025-05-07T20:33:39.0112764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.0114143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a65c0>}
2025-05-07T20:33:39.0115483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.0116613Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb012f70>
2025-05-07T20:33:39.0116908Z 
2025-05-07T20:33:39.0117073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.0117602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.0118075Z                            module_map=module_map)
2025-05-07T20:33:39.0118456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.0118850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.0119110Z E       ^
2025-05-07T20:33:39.0119576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.0120035Z 
2025-05-07T20:33:39.0120453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.0120962Z 
2025-05-07T20:33:39.0121070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.0121481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.0121885Z     T=1,
2025-05-07T20:33:39.0122076Z     D=7168,
2025-05-07T20:33:39.0122269Z     scale_ub=None,
2025-05-07T20:33:39.0122489Z     contiguous=False,
2025-05-07T20:33:39.0122839Z     compiled=False,
2025-05-07T20:33:39.0123044Z )
2025-05-07T20:33:39.0123364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.0123860Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:39.0124167Z 
2025-05-07T20:33:39.0124250Z     @given(
2025-05-07T20:33:39.0124480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.0124794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.0125106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.0125433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.0125762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.0126048Z     )
2025-05-07T20:33:39.0126390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.0126833Z     def test_silu_mul_quant(
2025-05-07T20:33:39.0127079Z         self,
2025-05-07T20:33:39.0127273Z         T: int,
2025-05-07T20:33:39.0127480Z         D: int,
2025-05-07T20:33:39.0127743Z         scale_ub: Optional[float],
2025-05-07T20:33:39.0128013Z         contiguous: bool,
2025-05-07T20:33:39.0128261Z         compiled: bool,
2025-05-07T20:33:39.0128489Z     ) -> None:
2025-05-07T20:33:39.0128702Z         torch.manual_seed(2025)
2025-05-07T20:33:39.0128947Z     
2025-05-07T20:33:39.0129225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.0129572Z     
2025-05-07T20:33:39.0129767Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.0130063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.0130370Z         x = x_sign * x_clamp
2025-05-07T20:33:39.0130614Z         x0 = x[:, :D]
2025-05-07T20:33:39.0130834Z         x1 = x[:, D:]
2025-05-07T20:33:39.0131047Z     
2025-05-07T20:33:39.0131233Z         if contiguous:
2025-05-07T20:33:39.0131465Z             x0 = x0.contiguous()
2025-05-07T20:33:39.0131725Z             x1 = x1.contiguous()
2025-05-07T20:33:39.0131969Z     
2025-05-07T20:33:39.0132167Z         if scale_ub is not None:
2025-05-07T20:33:39.0132450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.0132788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.0133105Z             )
2025-05-07T20:33:39.0133297Z         else:
2025-05-07T20:33:39.0133506Z             scale_ub_tensor = None
2025-05-07T20:33:39.0133757Z     
2025-05-07T20:33:39.0133994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.0134310Z             op = silu_mul_quant
2025-05-07T20:33:39.0134556Z             if compiled:
2025-05-07T20:33:39.0134810Z                 op = torch.compile(op)
2025-05-07T20:33:39.0135104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0135376Z     
2025-05-07T20:33:39.0135578Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.0135742Z 
2025-05-07T20:33:39.0135848Z moe/activation_test.py:117: 
2025-05-07T20:33:39.0136144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0136481Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.0136764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0137445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.0138136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.0138727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.0139409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.0140083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.0140614Z     kernel = self.compile(
2025-05-07T20:33:39.0141153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.0141895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.0142297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0142535Z 
2025-05-07T20:33:39.0142790Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cf1d0>
2025-05-07T20:33:39.0143867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.0145240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a71a0>}
2025-05-07T20:33:39.0146581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.0147652Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaccf9b0>
2025-05-07T20:33:39.0147950Z 
2025-05-07T20:33:39.0148118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.0148649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.0149117Z                            module_map=module_map)
2025-05-07T20:33:39.0149482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.0149844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.0150107Z E       ^
2025-05-07T20:33:39.0150569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.0151025Z 
2025-05-07T20:33:39.0151440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.0151947Z 
2025-05-07T20:33:39.0152059Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.0152479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.0152879Z     T=2048,
2025-05-07T20:33:39.0153071Z     D=7168,
2025-05-07T20:33:39.0153269Z     scale_ub=None,
2025-05-07T20:33:39.0153482Z     contiguous=False,
2025-05-07T20:33:39.0153706Z     compiled=True,
2025-05-07T20:33:39.0153912Z )
2025-05-07T20:33:39.1030172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.1030899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:39.1031291Z 
2025-05-07T20:33:39.1031399Z     @given(
2025-05-07T20:33:39.1031738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.1032170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.1032590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.1032978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.1033317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.1033603Z     )
2025-05-07T20:33:39.1033949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.1034396Z     def test_silu_mul_quant(
2025-05-07T20:33:39.1034633Z         self,
2025-05-07T20:33:39.1034854Z         T: int,
2025-05-07T20:33:39.1035051Z         D: int,
2025-05-07T20:33:39.1035263Z         scale_ub: Optional[float],
2025-05-07T20:33:39.1035540Z         contiguous: bool,
2025-05-07T20:33:39.1035868Z         compiled: bool,
2025-05-07T20:33:39.1036086Z     ) -> None:
2025-05-07T20:33:39.1036301Z         torch.manual_seed(2025)
2025-05-07T20:33:39.1036542Z     
2025-05-07T20:33:39.1036805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.1037148Z     
2025-05-07T20:33:39.1037339Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.1037624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.1037937Z         x = x_sign * x_clamp
2025-05-07T20:33:39.1038367Z         x0 = x[:, :D]
2025-05-07T20:33:39.1038583Z         x1 = x[:, D:]
2025-05-07T20:33:39.1038792Z     
2025-05-07T20:33:39.1038982Z         if contiguous:
2025-05-07T20:33:39.1039277Z             x0 = x0.contiguous()
2025-05-07T20:33:39.1039527Z             x1 = x1.contiguous()
2025-05-07T20:33:39.1039768Z     
2025-05-07T20:33:39.1039967Z         if scale_ub is not None:
2025-05-07T20:33:39.1040238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.1040574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.1040885Z             )
2025-05-07T20:33:39.1041073Z         else:
2025-05-07T20:33:39.1041285Z             scale_ub_tensor = None
2025-05-07T20:33:39.1041538Z     
2025-05-07T20:33:39.1041764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.1042079Z             op = silu_mul_quant
2025-05-07T20:33:39.1042331Z             if compiled:
2025-05-07T20:33:39.1042574Z                 op = torch.compile(op)
2025-05-07T20:33:39.1042875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.1043221Z     
2025-05-07T20:33:39.1043414Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.1043586Z 
2025-05-07T20:33:39.1043692Z moe/activation_test.py:117: 
2025-05-07T20:33:39.1043993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.1044325Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.1044600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.1045163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.1045725Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.1046375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.1047068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.1047608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.1048318Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.1048997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.1049530Z     kernel = self.compile(
2025-05-07T20:33:39.1050069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.1050727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.1051118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.1051352Z 
2025-05-07T20:33:39.1051564Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd4830>
2025-05-07T20:33:39.1052647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.1054017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac807c0>}
2025-05-07T20:33:39.1055356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.1056375Z context = <triton._C.libtriton.ir.context object at 0x7f2bfac4dd30>
2025-05-07T20:33:39.1056666Z 
2025-05-07T20:33:39.1056831Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.1057353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.1057820Z                            module_map=module_map)
2025-05-07T20:33:39.1058230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.1058627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.1058878Z E       ^
2025-05-07T20:33:39.1059340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.1059833Z 
2025-05-07T20:33:39.1060242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.1060747Z 
2025-05-07T20:33:39.1060858Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.1061262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.1061665Z     T=4096,
2025-05-07T20:33:39.1061856Z     D=7168,
2025-05-07T20:33:39.1062045Z     scale_ub=None,
2025-05-07T20:33:39.1062270Z     contiguous=False,
2025-05-07T20:33:39.1062494Z     compiled=True,
2025-05-07T20:33:39.1062695Z )
2025-05-07T20:33:39.1063023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.1063587Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:39.1063858Z 
2025-05-07T20:33:39.1063945Z     @given(
2025-05-07T20:33:39.1064178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.1064491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.1064803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.1065129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.1065640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.1065934Z     )
2025-05-07T20:33:39.1066274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.1066716Z     def test_silu_mul_quant(
2025-05-07T20:33:39.1066973Z         self,
2025-05-07T20:33:39.1067181Z         T: int,
2025-05-07T20:33:39.1067376Z         D: int,
2025-05-07T20:33:39.1067599Z         scale_ub: Optional[float],
2025-05-07T20:33:39.1067882Z         contiguous: bool,
2025-05-07T20:33:39.1068118Z         compiled: bool,
2025-05-07T20:33:39.1068345Z     ) -> None:
2025-05-07T20:33:39.1068560Z         torch.manual_seed(2025)
2025-05-07T20:33:39.1068806Z     
2025-05-07T20:33:39.1069084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.1069427Z     
2025-05-07T20:33:39.1069618Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.1069908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.1070224Z         x = x_sign * x_clamp
2025-05-07T20:33:39.1070460Z         x0 = x[:, :D]
2025-05-07T20:33:39.1070682Z         x1 = x[:, D:]
2025-05-07T20:33:39.1070893Z     
2025-05-07T20:33:39.1076712Z         if contiguous:
2025-05-07T20:33:39.1076993Z             x0 = x0.contiguous()
2025-05-07T20:33:39.1077255Z             x1 = x1.contiguous()
2025-05-07T20:33:39.1077503Z     
2025-05-07T20:33:39.1077705Z         if scale_ub is not None:
2025-05-07T20:33:39.1077978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.1078338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.1078701Z             )
2025-05-07T20:33:39.1078895Z         else:
2025-05-07T20:33:39.1079116Z             scale_ub_tensor = None
2025-05-07T20:33:39.1079381Z     
2025-05-07T20:33:39.1079625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.1079940Z             op = silu_mul_quant
2025-05-07T20:33:39.1080202Z             if compiled:
2025-05-07T20:33:39.1080457Z                 op = torch.compile(op)
2025-05-07T20:33:39.1080754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.1081031Z     
2025-05-07T20:33:39.1081227Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.1081396Z 
2025-05-07T20:33:39.1081498Z moe/activation_test.py:117: 
2025-05-07T20:33:39.1081802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.1082131Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.1082522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.1083608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.1084173Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.1084885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.1085567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.1086108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.1086778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.1087441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.1087976Z     kernel = self.compile(
2025-05-07T20:33:39.1088527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.1089236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.1089639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.1089871Z 
2025-05-07T20:33:39.1090081Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd5100>
2025-05-07T20:33:39.1091163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.1092532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac814e0>}
2025-05-07T20:33:39.1093885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.1094919Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbbabcf0>
2025-05-07T20:33:39.1095209Z 
2025-05-07T20:33:39.1095383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.1095898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.1096369Z                            module_map=module_map)
2025-05-07T20:33:39.1096730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.1097080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.1097335Z E       ^
2025-05-07T20:33:39.1097801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.1098252Z 
2025-05-07T20:33:39.1098720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.1099230Z 
2025-05-07T20:33:39.2692812Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.2693423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.2693992Z     T=16384,
2025-05-07T20:33:39.2694258Z     D=5120,
2025-05-07T20:33:39.2694531Z     scale_ub=1200.0,
2025-05-07T20:33:39.2694839Z     contiguous=False,
2025-05-07T20:33:39.2695142Z     compiled=False,
2025-05-07T20:33:39.2695416Z )
2025-05-07T20:33:39.2695828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.2696356Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:39.2696637Z 
2025-05-07T20:33:39.2696721Z     @given(
2025-05-07T20:33:39.2696957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.2697266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.2697576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.2698030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.2698419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.2698705Z     )
2025-05-07T20:33:39.2699056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.2699553Z     def test_silu_mul_quant(
2025-05-07T20:33:39.2699795Z         self,
2025-05-07T20:33:39.2699991Z         T: int,
2025-05-07T20:33:39.2700192Z         D: int,
2025-05-07T20:33:39.2700409Z         scale_ub: Optional[float],
2025-05-07T20:33:39.2700682Z         contiguous: bool,
2025-05-07T20:33:39.2700914Z         compiled: bool,
2025-05-07T20:33:39.2701142Z     ) -> None:
2025-05-07T20:33:39.2701359Z         torch.manual_seed(2025)
2025-05-07T20:33:39.2701598Z     
2025-05-07T20:33:39.2701862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.2702208Z     
2025-05-07T20:33:39.2702401Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.2702687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.2703001Z         x = x_sign * x_clamp
2025-05-07T20:33:39.2703308Z         x0 = x[:, :D]
2025-05-07T20:33:39.2703522Z         x1 = x[:, D:]
2025-05-07T20:33:39.2703741Z     
2025-05-07T20:33:39.2703925Z         if contiguous:
2025-05-07T20:33:39.2704151Z             x0 = x0.contiguous()
2025-05-07T20:33:39.2704414Z             x1 = x1.contiguous()
2025-05-07T20:33:39.2704657Z     
2025-05-07T20:33:39.2704844Z         if scale_ub is not None:
2025-05-07T20:33:39.2705122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.2705456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.2705755Z             )
2025-05-07T20:33:39.2705954Z         else:
2025-05-07T20:33:39.2706174Z             scale_ub_tensor = None
2025-05-07T20:33:39.2706422Z     
2025-05-07T20:33:39.2706660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.2706976Z             op = silu_mul_quant
2025-05-07T20:33:39.2707225Z             if compiled:
2025-05-07T20:33:39.2707470Z                 op = torch.compile(op)
2025-05-07T20:33:39.2707773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2708050Z     
2025-05-07T20:33:39.2708246Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.2708411Z 
2025-05-07T20:33:39.2708510Z moe/activation_test.py:117: 
2025-05-07T20:33:39.2708805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2709134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.2709417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2710106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.2710793Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.2711328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.2712014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.2712683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.2713215Z     kernel = self.compile(
2025-05-07T20:33:39.2713759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.2714418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.2714815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2715041Z 
2025-05-07T20:33:39.2715245Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd7b60>
2025-05-07T20:33:39.2716433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.2717912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac823e0>}
2025-05-07T20:33:39.2719254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.2720317Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbbc4cb0>
2025-05-07T20:33:39.2720605Z 
2025-05-07T20:33:39.2720769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.2721291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.2721767Z                            module_map=module_map)
2025-05-07T20:33:39.2722121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.2722475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.2722739Z E       ^
2025-05-07T20:33:39.2723241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.2723694Z 
2025-05-07T20:33:39.2724108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.2724620Z 
2025-05-07T20:33:39.2724726Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.2725140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.2725562Z     T=16384,
2025-05-07T20:33:39.2725754Z     D=5120,
2025-05-07T20:33:39.2725949Z     scale_ub=1200.0,
2025-05-07T20:33:39.2726172Z     contiguous=True,
2025-05-07T20:33:39.2726394Z     compiled=True,
2025-05-07T20:33:39.2726601Z )
2025-05-07T20:33:39.2726921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.2727416Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:39.2727696Z 
2025-05-07T20:33:39.2727777Z     @given(
2025-05-07T20:33:39.2728014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.2728331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.2728676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.2729015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.2729345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.2729622Z     )
2025-05-07T20:33:39.2729971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.2730414Z     def test_silu_mul_quant(
2025-05-07T20:33:39.2730661Z         self,
2025-05-07T20:33:39.2730852Z         T: int,
2025-05-07T20:33:39.2731051Z         D: int,
2025-05-07T20:33:39.2731271Z         scale_ub: Optional[float],
2025-05-07T20:33:39.2731535Z         contiguous: bool,
2025-05-07T20:33:39.2731772Z         compiled: bool,
2025-05-07T20:33:39.2731999Z     ) -> None:
2025-05-07T20:33:39.2732212Z         torch.manual_seed(2025)
2025-05-07T20:33:39.2732451Z     
2025-05-07T20:33:39.2732729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.2733072Z     
2025-05-07T20:33:39.2733264Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.2733557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.2733864Z         x = x_sign * x_clamp
2025-05-07T20:33:39.2734115Z         x0 = x[:, :D]
2025-05-07T20:33:39.2734336Z         x1 = x[:, D:]
2025-05-07T20:33:39.2734539Z     
2025-05-07T20:33:39.2734725Z         if contiguous:
2025-05-07T20:33:39.2734959Z             x0 = x0.contiguous()
2025-05-07T20:33:39.2735211Z             x1 = x1.contiguous()
2025-05-07T20:33:39.2735452Z     
2025-05-07T20:33:39.2735655Z         if scale_ub is not None:
2025-05-07T20:33:39.2735935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.2736267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.2736670Z             )
2025-05-07T20:33:39.2736867Z         else:
2025-05-07T20:33:39.2737073Z             scale_ub_tensor = None
2025-05-07T20:33:39.2737324Z     
2025-05-07T20:33:39.2737557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.2737940Z             op = silu_mul_quant
2025-05-07T20:33:39.2738191Z             if compiled:
2025-05-07T20:33:39.2738437Z                 op = torch.compile(op)
2025-05-07T20:33:39.2738727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2739014Z     
2025-05-07T20:33:39.2739213Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.2739375Z 
2025-05-07T20:33:39.2739472Z moe/activation_test.py:117: 
2025-05-07T20:33:39.2739766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2740104Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.2740394Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2740956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.2741556Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.2742211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.2742899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.2743440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.2744122Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.2744782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.2745305Z     kernel = self.compile(
2025-05-07T20:33:39.2745845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.2746502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.2746906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2747144Z 
2025-05-07T20:33:39.2747351Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd7bf0>
2025-05-07T20:33:39.2748434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.2749803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac83a60>}
2025-05-07T20:33:39.2751150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.2752168Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbba2c70>
2025-05-07T20:33:39.2752466Z 
2025-05-07T20:33:39.2752636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.2753161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.2753643Z                            module_map=module_map)
2025-05-07T20:33:39.2754002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.2754358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.2754625Z E       ^
2025-05-07T20:33:39.2755088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.2755539Z 
2025-05-07T20:33:39.2755995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.2756508Z 
2025-05-07T20:33:39.4479772Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.4480706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.4481277Z     T=16384,
2025-05-07T20:33:39.4481523Z     D=5120,
2025-05-07T20:33:39.4481729Z     scale_ub=None,
2025-05-07T20:33:39.4482023Z     contiguous=False,
2025-05-07T20:33:39.4482246Z     compiled=True,
2025-05-07T20:33:39.4482451Z )
2025-05-07T20:33:39.4482774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.4483270Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:39.4483555Z 
2025-05-07T20:33:39.4483635Z     @given(
2025-05-07T20:33:39.4483870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.4484187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.4484491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.4484818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.4485151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.4485440Z     )
2025-05-07T20:33:39.4485859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.4486305Z     def test_silu_mul_quant(
2025-05-07T20:33:39.4486547Z         self,
2025-05-07T20:33:39.4486743Z         T: int,
2025-05-07T20:33:39.4486942Z         D: int,
2025-05-07T20:33:39.4487155Z         scale_ub: Optional[float],
2025-05-07T20:33:39.4487434Z         contiguous: bool,
2025-05-07T20:33:39.4487677Z         compiled: bool,
2025-05-07T20:33:39.4487900Z     ) -> None:
2025-05-07T20:33:39.4488120Z         torch.manual_seed(2025)
2025-05-07T20:33:39.4488369Z     
2025-05-07T20:33:39.4488647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.4488996Z     
2025-05-07T20:33:39.4489206Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.4489513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.4489822Z         x = x_sign * x_clamp
2025-05-07T20:33:39.4490075Z         x0 = x[:, :D]
2025-05-07T20:33:39.4490309Z         x1 = x[:, D:]
2025-05-07T20:33:39.4490522Z     
2025-05-07T20:33:39.4490715Z         if contiguous:
2025-05-07T20:33:39.4490953Z             x0 = x0.contiguous()
2025-05-07T20:33:39.4491212Z             x1 = x1.contiguous()
2025-05-07T20:33:39.4491467Z     
2025-05-07T20:33:39.4491667Z         if scale_ub is not None:
2025-05-07T20:33:39.4491936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.4492285Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.4492598Z             )
2025-05-07T20:33:39.4492789Z         else:
2025-05-07T20:33:39.4493001Z             scale_ub_tensor = None
2025-05-07T20:33:39.4493256Z     
2025-05-07T20:33:39.4493488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.4493802Z             op = silu_mul_quant
2025-05-07T20:33:39.4494054Z             if compiled:
2025-05-07T20:33:39.4494305Z                 op = torch.compile(op)
2025-05-07T20:33:39.4494600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.4494884Z     
2025-05-07T20:33:39.4495084Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.4495247Z 
2025-05-07T20:33:39.4495346Z moe/activation_test.py:117: 
2025-05-07T20:33:39.4495645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.4495980Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.4496257Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.4496818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.4497379Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.4498040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.4498768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.4499306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.4500081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.4500748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.4501312Z     kernel = self.compile(
2025-05-07T20:33:39.4501851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.4502501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.4502898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.4503132Z 
2025-05-07T20:33:39.4503340Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb4a6f0>
2025-05-07T20:33:39.4504427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.4505853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb88cc0>}
2025-05-07T20:33:39.4507203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.4508221Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab8b5f0>
2025-05-07T20:33:39.4508522Z 
2025-05-07T20:33:39.4508697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.4509223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.4509694Z                            module_map=module_map)
2025-05-07T20:33:39.4510066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.4510426Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.4510700Z E       ^
2025-05-07T20:33:39.4511166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.4511624Z 
2025-05-07T20:33:39.4512035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.4512554Z 
2025-05-07T20:33:39.4512663Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.4513079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.4513480Z     T=2048,
2025-05-07T20:33:39.4513678Z     D=5120,
2025-05-07T20:33:39.4513877Z     scale_ub=None,
2025-05-07T20:33:39.4514095Z     contiguous=False,
2025-05-07T20:33:39.4514328Z     compiled=True,
2025-05-07T20:33:39.4514539Z )
2025-05-07T20:33:39.5425056Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.5426522Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:39.5427213Z 
2025-05-07T20:33:39.5427414Z     @given(
2025-05-07T20:33:39.5427999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.5428604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.5429050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.5429374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.5429706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.5429993Z     )
2025-05-07T20:33:39.5430336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.5430780Z     def test_silu_mul_quant(
2025-05-07T20:33:39.5431027Z         self,
2025-05-07T20:33:39.5431218Z         T: int,
2025-05-07T20:33:39.5431419Z         D: int,
2025-05-07T20:33:39.5431641Z         scale_ub: Optional[float],
2025-05-07T20:33:39.5431908Z         contiguous: bool,
2025-05-07T20:33:39.5432400Z         compiled: bool,
2025-05-07T20:33:39.5432628Z     ) -> None:
2025-05-07T20:33:39.5432839Z         torch.manual_seed(2025)
2025-05-07T20:33:39.5433081Z     
2025-05-07T20:33:39.5433356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.5433759Z     
2025-05-07T20:33:39.5433957Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.5434245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.5434557Z         x = x_sign * x_clamp
2025-05-07T20:33:39.5434791Z         x0 = x[:, :D]
2025-05-07T20:33:39.5435007Z         x1 = x[:, D:]
2025-05-07T20:33:39.5435218Z     
2025-05-07T20:33:39.5435403Z         if contiguous:
2025-05-07T20:33:39.5435635Z             x0 = x0.contiguous()
2025-05-07T20:33:39.5435973Z             x1 = x1.contiguous()
2025-05-07T20:33:39.5436207Z     
2025-05-07T20:33:39.5436402Z         if scale_ub is not None:
2025-05-07T20:33:39.5436682Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.5437017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.5437331Z             )
2025-05-07T20:33:39.5437605Z         else:
2025-05-07T20:33:39.5437816Z             scale_ub_tensor = None
2025-05-07T20:33:39.5438074Z     
2025-05-07T20:33:39.5438312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.5438625Z             op = silu_mul_quant
2025-05-07T20:33:39.5438883Z             if compiled:
2025-05-07T20:33:39.5439133Z                 op = torch.compile(op)
2025-05-07T20:33:39.5439435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.5439710Z     
2025-05-07T20:33:39.5439911Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.5440078Z 
2025-05-07T20:33:39.5440185Z moe/activation_test.py:117: 
2025-05-07T20:33:39.5440474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.5440813Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.5441095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.5441661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.5442219Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.5442873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.5443560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.5444087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.5444764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.5445425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.5445949Z     kernel = self.compile(
2025-05-07T20:33:39.5446483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.5447144Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.5447543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.5447774Z 
2025-05-07T20:33:39.5447982Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb48920>
2025-05-07T20:33:39.5449062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.5450432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb89a80>}
2025-05-07T20:33:39.5451822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.5452918Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab67eb0>
2025-05-07T20:33:39.5453207Z 
2025-05-07T20:33:39.5453371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.5453935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.5454403Z                            module_map=module_map)
2025-05-07T20:33:39.5454759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.5455115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.5455377Z E       ^
2025-05-07T20:33:39.5455841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.5456289Z 
2025-05-07T20:33:39.5456701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.5457212Z 
2025-05-07T20:33:39.5457328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.5464875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.5465309Z     T=2048,
2025-05-07T20:33:39.5465710Z     D=5120,
2025-05-07T20:33:39.5465905Z     scale_ub=1200.0,
2025-05-07T20:33:39.5466139Z     contiguous=False,
2025-05-07T20:33:39.5466372Z     compiled=True,
2025-05-07T20:33:39.5466578Z )
2025-05-07T20:33:39.5466907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.5467419Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.5467698Z 
2025-05-07T20:33:39.5467788Z     @given(
2025-05-07T20:33:39.5468015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.5468335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.5468640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.5468973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.5469308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.5469596Z     )
2025-05-07T20:33:39.5469954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.5470409Z     def test_silu_mul_quant(
2025-05-07T20:33:39.5470659Z         self,
2025-05-07T20:33:39.5470854Z         T: int,
2025-05-07T20:33:39.5471059Z         D: int,
2025-05-07T20:33:39.5471283Z         scale_ub: Optional[float],
2025-05-07T20:33:39.5471554Z         contiguous: bool,
2025-05-07T20:33:39.5471800Z         compiled: bool,
2025-05-07T20:33:39.5472026Z     ) -> None:
2025-05-07T20:33:39.5472238Z         torch.manual_seed(2025)
2025-05-07T20:33:39.5472483Z     
2025-05-07T20:33:39.5472761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.5473103Z     
2025-05-07T20:33:39.5473302Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.5473592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.5473912Z         x = x_sign * x_clamp
﻿2025-05-07T20:33:39.5477038Z         x0 = x[:, :D]
2025-05-07T20:33:39.5477272Z         x1 = x[:, D:]
2025-05-07T20:33:39.5477486Z     
2025-05-07T20:33:39.5477680Z         if contiguous:
2025-05-07T20:33:39.5477921Z             x0 = x0.contiguous()
2025-05-07T20:33:39.5478184Z             x1 = x1.contiguous()
2025-05-07T20:33:39.5478434Z     
2025-05-07T20:33:39.5478637Z         if scale_ub is not None:
2025-05-07T20:33:39.5478914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.5479258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.5479582Z             )
2025-05-07T20:33:39.5479784Z         else:
2025-05-07T20:33:39.5479993Z             scale_ub_tensor = None
2025-05-07T20:33:39.5480248Z     
2025-05-07T20:33:39.5480489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.5480805Z             op = silu_mul_quant
2025-05-07T20:33:39.5481062Z             if compiled:
2025-05-07T20:33:39.5481413Z                 op = torch.compile(op)
2025-05-07T20:33:39.5481714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.5482031Z     
2025-05-07T20:33:39.5482227Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.5482465Z 
2025-05-07T20:33:39.5482565Z moe/activation_test.py:117: 
2025-05-07T20:33:39.5482867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.5483207Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.5483493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.5484056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.5484613Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.5485280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.5485971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.5486517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.5487272Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.5487946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.5488489Z     kernel = self.compile(
2025-05-07T20:33:39.5489034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.5489692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.5490097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.5490329Z 
2025-05-07T20:33:39.5490547Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb4b260>
2025-05-07T20:33:39.5491648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.5493028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb8ac00>}
2025-05-07T20:33:39.5494388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.5495426Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab4c130>
2025-05-07T20:33:39.5495717Z 
2025-05-07T20:33:39.5495898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.5496423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.5496908Z                            module_map=module_map)
2025-05-07T20:33:39.5497291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.5497743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.5498011Z E       ^
2025-05-07T20:33:39.5498481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.5498987Z 
2025-05-07T20:33:39.5499410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.5499924Z 
2025-05-07T20:33:39.7239141Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.7239577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.7240105Z     T=4096,
2025-05-07T20:33:39.7240390Z     D=5120,
2025-05-07T20:33:39.7240583Z     scale_ub=1200.0,
2025-05-07T20:33:39.7240820Z     contiguous=True,
2025-05-07T20:33:39.7241053Z     compiled=True,
2025-05-07T20:33:39.7241262Z )
2025-05-07T20:33:39.7241720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.7242239Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:39.7242514Z 
2025-05-07T20:33:39.7242602Z     @given(
2025-05-07T20:33:39.7242899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.7243232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.7243561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.7243899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.7244234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.7244525Z     )
2025-05-07T20:33:39.7244874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.7245324Z     def test_silu_mul_quant(
2025-05-07T20:33:39.7245578Z         self,
2025-05-07T20:33:39.7245787Z         T: int,
2025-05-07T20:33:39.7245993Z         D: int,
2025-05-07T20:33:39.7246219Z         scale_ub: Optional[float],
2025-05-07T20:33:39.7246501Z         contiguous: bool,
2025-05-07T20:33:39.7246745Z         compiled: bool,
2025-05-07T20:33:39.7247038Z     ) -> None:
2025-05-07T20:33:39.7247271Z         torch.manual_seed(2025)
2025-05-07T20:33:39.7247515Z     
2025-05-07T20:33:39.7247796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.7248143Z     
2025-05-07T20:33:39.7248344Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.7248644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.7248967Z         x = x_sign * x_clamp
2025-05-07T20:33:39.7249211Z         x0 = x[:, :D]
2025-05-07T20:33:39.7249434Z         x1 = x[:, D:]
2025-05-07T20:33:39.7249654Z     
2025-05-07T20:33:39.7249839Z         if contiguous:
2025-05-07T20:33:39.7250081Z             x0 = x0.contiguous()
2025-05-07T20:33:39.7250349Z             x1 = x1.contiguous()
2025-05-07T20:33:39.7250586Z     
2025-05-07T20:33:39.7250783Z         if scale_ub is not None:
2025-05-07T20:33:39.7251063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.7251419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.7251733Z             )
2025-05-07T20:33:39.7251929Z         else:
2025-05-07T20:33:39.7252150Z             scale_ub_tensor = None
2025-05-07T20:33:39.7252403Z     
2025-05-07T20:33:39.7252639Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.7252960Z             op = silu_mul_quant
2025-05-07T20:33:39.7253210Z             if compiled:
2025-05-07T20:33:39.7253459Z                 op = torch.compile(op)
2025-05-07T20:33:39.7253760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7254034Z     
2025-05-07T20:33:39.7254233Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.7254399Z 
2025-05-07T20:33:39.7254505Z moe/activation_test.py:117: 
2025-05-07T20:33:39.7254803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7255141Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.7255432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7256092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.7256655Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.7257315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.7258005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.7258538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.7259225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.7259892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.7260427Z     kernel = self.compile(
2025-05-07T20:33:39.7261017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.7261683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.7262086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7262360Z 
2025-05-07T20:33:39.7262575Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83c8c0>
2025-05-07T20:33:39.7263667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.7265047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa828220>}
2025-05-07T20:33:39.7266591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.7267693Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa876a70>
2025-05-07T20:33:39.7267992Z 
2025-05-07T20:33:39.7268159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.7268693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.7269166Z                            module_map=module_map)
2025-05-07T20:33:39.7269535Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.7269890Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.7270160Z E       ^
2025-05-07T20:33:39.7270628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.7271083Z 
2025-05-07T20:33:39.7271509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.7272021Z 
2025-05-07T20:33:39.7272133Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.7272553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.7272961Z     T=128,
2025-05-07T20:33:39.7273148Z     D=5120,
2025-05-07T20:33:39.7273348Z     scale_ub=1200.0,
2025-05-07T20:33:39.7273579Z     contiguous=False,
2025-05-07T20:33:39.7273803Z     compiled=True,
2025-05-07T20:33:39.7274010Z )
2025-05-07T20:33:39.9958104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.9958839Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.9959225Z 
2025-05-07T20:33:39.9959357Z     @given(
2025-05-07T20:33:39.9959676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.9960119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.9960521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.9960873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.9961365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.9961669Z     )
2025-05-07T20:33:39.9962034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.9962486Z     def test_silu_mul_quant(
2025-05-07T20:33:39.9962738Z         self,
2025-05-07T20:33:39.9962944Z         T: int,
2025-05-07T20:33:39.9963144Z         D: int,
2025-05-07T20:33:39.9963382Z         scale_ub: Optional[float],
2025-05-07T20:33:39.9963696Z         contiguous: bool,
2025-05-07T20:33:39.9963950Z         compiled: bool,
2025-05-07T20:33:39.9964182Z     ) -> None:
2025-05-07T20:33:39.9964399Z         torch.manual_seed(2025)
2025-05-07T20:33:39.9964646Z     
2025-05-07T20:33:39.9964925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.9965270Z     
2025-05-07T20:33:39.9965725Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.9966113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.9966427Z         x = x_sign * x_clamp
2025-05-07T20:33:39.9966672Z         x0 = x[:, :D]
2025-05-07T20:33:39.9966896Z         x1 = x[:, D:]
2025-05-07T20:33:39.9967162Z     
2025-05-07T20:33:39.9967352Z         if contiguous:
2025-05-07T20:33:39.9967586Z             x0 = x0.contiguous()
2025-05-07T20:33:39.9967842Z             x1 = x1.contiguous()
2025-05-07T20:33:39.9968089Z     
2025-05-07T20:33:39.9968287Z         if scale_ub is not None:
2025-05-07T20:33:39.9968565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.9968926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.9969261Z             )
2025-05-07T20:33:39.9969455Z         else:
2025-05-07T20:33:39.9969667Z             scale_ub_tensor = None
2025-05-07T20:33:39.9969925Z     
2025-05-07T20:33:39.9970158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.9970476Z             op = silu_mul_quant
2025-05-07T20:33:39.9970725Z             if compiled:
2025-05-07T20:33:39.9970985Z                 op = torch.compile(op)
2025-05-07T20:33:39.9971340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9971624Z     
2025-05-07T20:33:39.9971825Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.9971992Z 
2025-05-07T20:33:39.9972095Z moe/activation_test.py:117: 
2025-05-07T20:33:39.9972400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9972741Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.9973030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9973586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.9974155Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.9974816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.9975502Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.9976049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.9976736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.9977406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.9977935Z     kernel = self.compile(
2025-05-07T20:33:39.9978481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.9979142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.9979547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9979781Z 
2025-05-07T20:33:39.9979989Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83d190>
2025-05-07T20:33:39.9981084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.9982547Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa828f40>}
2025-05-07T20:33:39.9983889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.9984910Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa89f830>
2025-05-07T20:33:39.9985202Z 
2025-05-07T20:33:39.9985368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.9985894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.9986406Z                            module_map=module_map)
2025-05-07T20:33:39.9986776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.9987133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.9987435Z E       ^
2025-05-07T20:33:39.9987898Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.9988354Z 
2025-05-07T20:33:39.9988766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.9989279Z 
2025-05-07T20:33:39.9989384Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.9989801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.9990198Z     T=16384,
2025-05-07T20:33:39.9990396Z     D=7168,
2025-05-07T20:33:39.9990593Z     scale_ub=1200.0,
2025-05-07T20:33:39.9990813Z     contiguous=True,
2025-05-07T20:33:39.9991041Z     compiled=True,
2025-05-07T20:33:39.9991247Z )
2025-05-07T20:33:39.9991608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.9992109Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:39.9992395Z 
2025-05-07T20:33:39.9992478Z     @given(
2025-05-07T20:33:39.9992713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.9993031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.9993343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.9993675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.9994001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.9994288Z     )
2025-05-07T20:33:39.9994639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.9995077Z     def test_silu_mul_quant(
2025-05-07T20:33:39.9995327Z         self,
2025-05-07T20:33:39.9995527Z         T: int,
2025-05-07T20:33:39.9995798Z         D: int,
2025-05-07T20:33:39.9996026Z         scale_ub: Optional[float],
2025-05-07T20:33:39.9996306Z         contiguous: bool,
2025-05-07T20:33:39.9996552Z         compiled: bool,
2025-05-07T20:33:39.9996769Z     ) -> None:
2025-05-07T20:33:39.9996994Z         torch.manual_seed(2025)
2025-05-07T20:33:39.9997241Z     
2025-05-07T20:33:39.9997513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.9997861Z     
2025-05-07T20:33:39.9998053Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.9998342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.9998657Z         x = x_sign * x_clamp
2025-05-07T20:33:39.9998906Z         x0 = x[:, :D]
2025-05-07T20:33:39.9999123Z         x1 = x[:, D:]
2025-05-07T20:33:39.9999337Z     
2025-05-07T20:33:39.9999533Z         if contiguous:
2025-05-07T20:33:39.9999768Z             x0 = x0.contiguous()
2025-05-07T20:33:40.0000024Z             x1 = x1.contiguous()
2025-05-07T20:33:40.0000265Z     
2025-05-07T20:33:40.0000457Z         if scale_ub is not None:
2025-05-07T20:33:40.0000790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.0001125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.0001442Z             )
2025-05-07T20:33:40.0001634Z         else:
2025-05-07T20:33:40.0001847Z             scale_ub_tensor = None
2025-05-07T20:33:40.0002106Z     
2025-05-07T20:33:40.0002334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.0002651Z             op = silu_mul_quant
2025-05-07T20:33:40.0002901Z             if compiled:
2025-05-07T20:33:40.0003144Z                 op = torch.compile(op)
2025-05-07T20:33:40.0003445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.0003718Z     
2025-05-07T20:33:40.0003908Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.0004082Z 
2025-05-07T20:33:40.0004182Z moe/activation_test.py:117: 
2025-05-07T20:33:40.0004478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.0004854Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.0005145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.0005700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.0006333Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.0006981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.0007676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.0008213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.0008893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.0009552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.0010089Z     kernel = self.compile(
2025-05-07T20:33:40.0010671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.0011324Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.0011731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.0011964Z 
2025-05-07T20:33:40.0012170Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83ea80>
2025-05-07T20:33:40.0013258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.0014623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa82a160>}
2025-05-07T20:33:40.0015978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.0017004Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa601170>
2025-05-07T20:33:40.0017300Z 
2025-05-07T20:33:40.0017476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.0018003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.0018473Z                            module_map=module_map)
2025-05-07T20:33:40.0018846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.0019258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.0019518Z E       ^
2025-05-07T20:33:40.0019984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.0020435Z 
2025-05-07T20:33:40.0020853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.0021415Z 
2025-05-07T20:33:40.1257020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.1257661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.1258271Z     T=16384,
2025-05-07T20:33:40.1258551Z     D=5120,
2025-05-07T20:33:40.1258819Z     scale_ub=1200.0,
2025-05-07T20:33:40.1259158Z     contiguous=True,
2025-05-07T20:33:40.1259460Z     compiled=False,
2025-05-07T20:33:40.1259668Z )
2025-05-07T20:33:40.1259993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.1260500Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.1260780Z 
2025-05-07T20:33:40.1260866Z     @given(
2025-05-07T20:33:40.1261096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.1261419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.1261860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.1262200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.1262537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.1262892Z     )
2025-05-07T20:33:40.1263237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.1263690Z     def test_silu_mul_quant(
2025-05-07T20:33:40.1263942Z         self,
2025-05-07T20:33:40.1264145Z         T: int,
2025-05-07T20:33:40.1264346Z         D: int,
2025-05-07T20:33:40.1264567Z         scale_ub: Optional[float],
2025-05-07T20:33:40.1264858Z         contiguous: bool,
2025-05-07T20:33:40.1265110Z         compiled: bool,
2025-05-07T20:33:40.1265592Z     ) -> None:
2025-05-07T20:33:40.1271689Z         torch.manual_seed(2025)
2025-05-07T20:33:40.1271978Z     
2025-05-07T20:33:40.1272251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.1272594Z     
2025-05-07T20:33:40.1272814Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.1273211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.1273532Z         x = x_sign * x_clamp
2025-05-07T20:33:40.1273781Z         x0 = x[:, :D]
2025-05-07T20:33:40.1274016Z         x1 = x[:, D:]
2025-05-07T20:33:40.1274226Z     
2025-05-07T20:33:40.1274432Z         if contiguous:
2025-05-07T20:33:40.1274678Z             x0 = x0.contiguous()
2025-05-07T20:33:40.1274945Z             x1 = x1.contiguous()
2025-05-07T20:33:40.1275197Z     
2025-05-07T20:33:40.1275396Z         if scale_ub is not None:
2025-05-07T20:33:40.1275740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.1276088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.1276406Z             )
2025-05-07T20:33:40.1276605Z         else:
2025-05-07T20:33:40.1276820Z             scale_ub_tensor = None
2025-05-07T20:33:40.1277069Z     
2025-05-07T20:33:40.1277307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.1277630Z             op = silu_mul_quant
2025-05-07T20:33:40.1277883Z             if compiled:
2025-05-07T20:33:40.1278133Z                 op = torch.compile(op)
2025-05-07T20:33:40.1278434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1278718Z     
2025-05-07T20:33:40.1278919Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.1279091Z 
2025-05-07T20:33:40.1279193Z moe/activation_test.py:117: 
2025-05-07T20:33:40.1279493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1279824Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.1280107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1280801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.1281488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.1282028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.1282711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.1283461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.1283997Z     kernel = self.compile(
2025-05-07T20:33:40.1284540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.1285195Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.1285595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1285827Z 
2025-05-07T20:33:40.1286036Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83e630>
2025-05-07T20:33:40.1287190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.1288582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa829b20>}
2025-05-07T20:33:40.1289989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.1291016Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa6e2d30>
2025-05-07T20:33:40.1291312Z 
2025-05-07T20:33:40.1291477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.1292006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.1292483Z                            module_map=module_map)
2025-05-07T20:33:40.1292846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.1293208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.1293513Z E       ^
2025-05-07T20:33:40.1293975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.1294440Z 
2025-05-07T20:33:40.1294860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.1295376Z 
2025-05-07T20:33:40.1295481Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.1295898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.1296299Z     T=1,
2025-05-07T20:33:40.1296489Z     D=7168,
2025-05-07T20:33:40.1296687Z     scale_ub=1200.0,
2025-05-07T20:33:40.1296909Z     contiguous=False,
2025-05-07T20:33:40.1297136Z     compiled=False,
2025-05-07T20:33:40.1297346Z )
2025-05-07T20:33:40.1297668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.1298163Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:40.1298446Z 
2025-05-07T20:33:40.1298525Z     @given(
2025-05-07T20:33:40.1298762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.1299081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.1299394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.1299739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.1300074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.1300369Z     )
2025-05-07T20:33:40.1300725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.1301165Z     def test_silu_mul_quant(
2025-05-07T20:33:40.1301417Z         self,
2025-05-07T20:33:40.1301614Z         T: int,
2025-05-07T20:33:40.1301810Z         D: int,
2025-05-07T20:33:40.1302033Z         scale_ub: Optional[float],
2025-05-07T20:33:40.1302307Z         contiguous: bool,
2025-05-07T20:33:40.1302548Z         compiled: bool,
2025-05-07T20:33:40.1302821Z     ) -> None:
2025-05-07T20:33:40.1303041Z         torch.manual_seed(2025)
2025-05-07T20:33:40.1303281Z     
2025-05-07T20:33:40.1303555Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.1303911Z     
2025-05-07T20:33:40.1304106Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.1304393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.1304704Z         x = x_sign * x_clamp
2025-05-07T20:33:40.1304945Z         x0 = x[:, :D]
2025-05-07T20:33:40.1305159Z         x1 = x[:, D:]
2025-05-07T20:33:40.1305372Z     
2025-05-07T20:33:40.1305564Z         if contiguous:
2025-05-07T20:33:40.1305797Z             x0 = x0.contiguous()
2025-05-07T20:33:40.1306061Z             x1 = x1.contiguous()
2025-05-07T20:33:40.1306304Z     
2025-05-07T20:33:40.1306492Z         if scale_ub is not None:
2025-05-07T20:33:40.1306766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.1307153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.1307470Z             )
2025-05-07T20:33:40.1307662Z         else:
2025-05-07T20:33:40.1307874Z             scale_ub_tensor = None
2025-05-07T20:33:40.1308175Z     
2025-05-07T20:33:40.1308407Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.1308723Z             op = silu_mul_quant
2025-05-07T20:33:40.1308974Z             if compiled:
2025-05-07T20:33:40.1309219Z                 op = torch.compile(op)
2025-05-07T20:33:40.1309517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1309793Z     
2025-05-07T20:33:40.1309985Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.1310156Z 
2025-05-07T20:33:40.1310259Z moe/activation_test.py:117: 
2025-05-07T20:33:40.1310560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1310892Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.1311178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1311912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.1312620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.1313162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.1313845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.1314511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.1315042Z     kernel = self.compile(
2025-05-07T20:33:40.1315576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.1316282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.1316683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1316917Z 
2025-05-07T20:33:40.1317139Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635b50>
2025-05-07T20:33:40.1318221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.1319600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61c180>}
2025-05-07T20:33:40.1320947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.1321977Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa4f6cb0>
2025-05-07T20:33:40.1322266Z 
2025-05-07T20:33:40.1322435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.1323017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.1323500Z                            module_map=module_map)
2025-05-07T20:33:40.1323866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.1324223Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.1324483Z E       ^
2025-05-07T20:33:40.1324954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.1325409Z 
2025-05-07T20:33:40.1325826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.1326345Z 
2025-05-07T20:33:40.3061715Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.3062581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.3063334Z     T=4096,
2025-05-07T20:33:40.3064000Z     D=7168,
2025-05-07T20:33:40.3064380Z     scale_ub=1200.0,
2025-05-07T20:33:40.3064806Z     contiguous=False,
2025-05-07T20:33:40.3065220Z     compiled=True,
2025-05-07T20:33:40.3065988Z )
2025-05-07T20:33:40.3066583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.3067508Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:40.3068032Z 
2025-05-07T20:33:40.3068184Z     @given(
2025-05-07T20:33:40.3068618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.3069180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.3069530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.3069874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.3070219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.3070507Z     )
2025-05-07T20:33:40.3070886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.3071339Z     def test_silu_mul_quant(
2025-05-07T20:33:40.3071677Z         self,
2025-05-07T20:33:40.3071888Z         T: int,
2025-05-07T20:33:40.3072097Z         D: int,
2025-05-07T20:33:40.3072318Z         scale_ub: Optional[float],
2025-05-07T20:33:40.3072599Z         contiguous: bool,
2025-05-07T20:33:40.3072852Z         compiled: bool,
2025-05-07T20:33:40.3073082Z     ) -> None:
2025-05-07T20:33:40.3073309Z         torch.manual_seed(2025)
2025-05-07T20:33:40.3073572Z     
2025-05-07T20:33:40.3073847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.3074205Z     
2025-05-07T20:33:40.3074409Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.3074699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.3075017Z         x = x_sign * x_clamp
2025-05-07T20:33:40.3075267Z         x0 = x[:, :D]
2025-05-07T20:33:40.3075489Z         x1 = x[:, D:]
2025-05-07T20:33:40.3075745Z     
2025-05-07T20:33:40.3075942Z         if contiguous:
2025-05-07T20:33:40.3076181Z             x0 = x0.contiguous()
2025-05-07T20:33:40.3076446Z             x1 = x1.contiguous()
2025-05-07T20:33:40.3076695Z     
2025-05-07T20:33:40.3076902Z         if scale_ub is not None:
2025-05-07T20:33:40.3077200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.3077546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.3077864Z             )
2025-05-07T20:33:40.3078057Z         else:
2025-05-07T20:33:40.3078282Z             scale_ub_tensor = None
2025-05-07T20:33:40.3078545Z     
2025-05-07T20:33:40.3078778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.3079101Z             op = silu_mul_quant
2025-05-07T20:33:40.3079359Z             if compiled:
2025-05-07T20:33:40.3079608Z                 op = torch.compile(op)
2025-05-07T20:33:40.3079911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.3080197Z     
2025-05-07T20:33:40.3080393Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.3080569Z 
2025-05-07T20:33:40.3080771Z moe/activation_test.py:117: 
2025-05-07T20:33:40.3081077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.3081421Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.3081707Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.3082281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.3082851Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.3083505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.3084195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.3084743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.3085430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.3086167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.3086715Z     kernel = self.compile(
2025-05-07T20:33:40.3087325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.3087979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.3088389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.3088632Z 
2025-05-07T20:33:40.3088846Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635490>
2025-05-07T20:33:40.3089940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.3091387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61d3a0>}
2025-05-07T20:33:40.3092737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.3093777Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa495d70>
2025-05-07T20:33:40.3094069Z 
2025-05-07T20:33:40.3094249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.3094783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.3095255Z                            module_map=module_map)
2025-05-07T20:33:40.3095622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.3095984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.3096246Z E       ^
2025-05-07T20:33:40.3096725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.3097180Z 
2025-05-07T20:33:40.3097605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.3098123Z 
2025-05-07T20:33:40.3098239Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.3098660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.3099083Z     T=128,
2025-05-07T20:33:40.3099317Z     D=7168,
2025-05-07T20:33:40.3099521Z     scale_ub=1200.0,
2025-05-07T20:33:40.3099756Z     contiguous=False,
2025-05-07T20:33:40.3099994Z     compiled=True,
2025-05-07T20:33:40.3100201Z )
2025-05-07T20:33:40.4008848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4009406Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:40.4009693Z 
2025-05-07T20:33:40.4009789Z     @given(
2025-05-07T20:33:40.4010161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4010486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4010804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4011141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4011480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4011773Z     )
2025-05-07T20:33:40.4012127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4012582Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4012837Z         self,
2025-05-07T20:33:40.4013039Z         T: int,
2025-05-07T20:33:40.4013242Z         D: int,
2025-05-07T20:33:40.4013470Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4013750Z         contiguous: bool,
2025-05-07T20:33:40.4013991Z         compiled: bool,
2025-05-07T20:33:40.4014225Z     ) -> None:
2025-05-07T20:33:40.4014528Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4014772Z     
2025-05-07T20:33:40.4015057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4015406Z     
2025-05-07T20:33:40.4015599Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4015964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4016280Z         x = x_sign * x_clamp
2025-05-07T20:33:40.4016526Z         x0 = x[:, :D]
2025-05-07T20:33:40.4016748Z         x1 = x[:, D:]
2025-05-07T20:33:40.4016965Z     
2025-05-07T20:33:40.4017153Z         if contiguous:
2025-05-07T20:33:40.4017390Z             x0 = x0.contiguous()
2025-05-07T20:33:40.4017658Z             x1 = x1.contiguous()
2025-05-07T20:33:40.4017903Z     
2025-05-07T20:33:40.4018100Z         if scale_ub is not None:
2025-05-07T20:33:40.4018380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.4018727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.4019037Z             )
2025-05-07T20:33:40.4019243Z         else:
2025-05-07T20:33:40.4019463Z             scale_ub_tensor = None
2025-05-07T20:33:40.4019788Z     
2025-05-07T20:33:40.4020191Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.4020525Z             op = silu_mul_quant
2025-05-07T20:33:40.4020775Z             if compiled:
2025-05-07T20:33:40.4021033Z                 op = torch.compile(op)
2025-05-07T20:33:40.4021344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4021622Z     
2025-05-07T20:33:40.4021828Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.4021999Z 
2025-05-07T20:33:40.4022114Z moe/activation_test.py:117: 
2025-05-07T20:33:40.4022417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4022761Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.4023060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4023635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.4024200Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.4024883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.4025583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.4026120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.4026812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.4027494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.4028038Z     kernel = self.compile(
2025-05-07T20:33:40.4028578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.4029245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.4029664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4029964Z 
2025-05-07T20:33:40.4030191Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635070>
2025-05-07T20:33:40.4031278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.4032661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61e0c0>}
2025-05-07T20:33:40.4034009Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.4035041Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa75fab0>
2025-05-07T20:33:40.4035377Z 
2025-05-07T20:33:40.4035558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.4036160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.4036681Z                            module_map=module_map)
2025-05-07T20:33:40.4037053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.4037409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.4037674Z E       ^
2025-05-07T20:33:40.4038144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.4038598Z 
2025-05-07T20:33:40.4039022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.4039533Z 
2025-05-07T20:33:40.4039639Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4040063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4040479Z     T=2048,
2025-05-07T20:33:40.4040714Z     D=7168,
2025-05-07T20:33:40.4040918Z     scale_ub=None,
2025-05-07T20:33:40.4041140Z     contiguous=True,
2025-05-07T20:33:40.4041365Z     compiled=True,
2025-05-07T20:33:40.4041573Z )
2025-05-07T20:33:40.4041899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4042394Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:40.4042673Z 
2025-05-07T20:33:40.4042757Z     @given(
2025-05-07T20:33:40.4042999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4043321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4043631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4043977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4044313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4044606Z     )
2025-05-07T20:33:40.4044973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4045434Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4045681Z         self,
2025-05-07T20:33:40.4045893Z         T: int,
2025-05-07T20:33:40.4046109Z         D: int,
2025-05-07T20:33:40.4046335Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4046626Z         contiguous: bool,
2025-05-07T20:33:40.4046875Z         compiled: bool,
2025-05-07T20:33:40.4047110Z     ) -> None:
2025-05-07T20:33:40.4047328Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4047603Z     
2025-05-07T20:33:40.4047891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4048243Z     
2025-05-07T20:33:40.4048450Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4048755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4049073Z         x = x_sign * x_clamp
2025-05-07T20:33:40.4049378Z         x0 = x[:, :D]
2025-05-07T20:33:40.4049618Z         x1 = x[:, D:]
2025-05-07T20:33:40.4049845Z     
2025-05-07T20:33:40.4050090Z         if contiguous:
2025-05-07T20:33:40.4050338Z             x0 = x0.contiguous()
2025-05-07T20:33:40.4050608Z             x1 = x1.contiguous()
2025-05-07T20:33:40.4050854Z     
2025-05-07T20:33:40.4051059Z         if scale_ub is not None:
2025-05-07T20:33:40.4051341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.4051678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.4051996Z             )
2025-05-07T20:33:40.4052201Z         else:
2025-05-07T20:33:40.4052415Z             scale_ub_tensor = None
2025-05-07T20:33:40.4052679Z     
2025-05-07T20:33:40.4052923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.4053238Z             op = silu_mul_quant
2025-05-07T20:33:40.4053496Z             if compiled:
2025-05-07T20:33:40.4053754Z                 op = torch.compile(op)
2025-05-07T20:33:40.4054052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4054382Z     
2025-05-07T20:33:40.4054584Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.4054758Z 
2025-05-07T20:33:40.4054864Z moe/activation_test.py:117: 
2025-05-07T20:33:40.4055158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4055543Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.4055838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4056395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.4056970Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.4057632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.4058326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.4058871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.4059572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.4067393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.4067952Z     kernel = self.compile(
2025-05-07T20:33:40.4068516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.4069175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.4069601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4069838Z 
2025-05-07T20:33:40.4070066Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa7b4350>
2025-05-07T20:33:40.4071156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.4072556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61f2e0>}
2025-05-07T20:33:40.4073919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.4074959Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa7c00f0>
2025-05-07T20:33:40.4075253Z 
2025-05-07T20:33:40.4075433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.4076047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.4076525Z                            module_map=module_map)
2025-05-07T20:33:40.4076898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.4077259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.4077522Z E       ^
2025-05-07T20:33:40.4078087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.4078541Z 
2025-05-07T20:33:40.4078958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.4079527Z 
2025-05-07T20:33:40.4745616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4746056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4746517Z     T=16384,
2025-05-07T20:33:40.4746724Z     D=5120,
2025-05-07T20:33:40.4746924Z     scale_ub=None,
2025-05-07T20:33:40.4747149Z     contiguous=False,
2025-05-07T20:33:40.4747386Z     compiled=False,
2025-05-07T20:33:40.4747603Z )
2025-05-07T20:33:40.4747926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4748558Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:40.4748850Z 
2025-05-07T20:33:40.4748941Z     @given(
2025-05-07T20:33:40.4749199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4749557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4749938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4750269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4750608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4750904Z     )
2025-05-07T20:33:40.4751258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4751702Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4751950Z         self,
2025-05-07T20:33:40.4752151Z         T: int,
2025-05-07T20:33:40.4752348Z         D: int,
2025-05-07T20:33:40.4752578Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4752858Z         contiguous: bool,
2025-05-07T20:33:40.4753097Z         compiled: bool,
2025-05-07T20:33:40.4753336Z     ) -> None:
2025-05-07T20:33:40.4753561Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4753871Z     
2025-05-07T20:33:40.4754155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4754514Z     
2025-05-07T20:33:40.4754708Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4755004Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4757114Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.4759009Z 
2025-05-07T20:33:40.4759133Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:40.4759352Z 
2025-05-07T20:33:40.4759463Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4759878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4760294Z     T=4096,
2025-05-07T20:33:40.4760496Z     D=7168,
2025-05-07T20:33:40.4760699Z     scale_ub=1200.0,
2025-05-07T20:33:40.4760930Z     contiguous=True,
2025-05-07T20:33:40.4761159Z     compiled=True,
2025-05-07T20:33:40.4761361Z )
2025-05-07T20:33:40.4761688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4762194Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:40.4762469Z 
2025-05-07T20:33:40.4762558Z     @given(
2025-05-07T20:33:40.4762789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4763119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4763437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4763858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4764198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4764497Z     )
2025-05-07T20:33:40.4764853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4765309Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4765780Z         self,
2025-05-07T20:33:40.4765981Z         T: int,
2025-05-07T20:33:40.4766186Z         D: int,
2025-05-07T20:33:40.4766415Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4766694Z         contiguous: bool,
2025-05-07T20:33:40.4766939Z         compiled: bool,
2025-05-07T20:33:40.4767167Z     ) -> None:
2025-05-07T20:33:40.4767387Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4767627Z     
2025-05-07T20:33:40.4767909Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4768257Z     
2025-05-07T20:33:40.4768528Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4768828Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4770901Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.4772833Z 
2025-05-07T20:33:40.4772960Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:40.4773172Z 
2025-05-07T20:33:40.4773285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4773698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4774110Z     T=16384,
2025-05-07T20:33:40.4774316Z     D=7168,
2025-05-07T20:33:40.4774565Z     scale_ub=None,
2025-05-07T20:33:40.4774787Z     contiguous=False,
2025-05-07T20:33:40.4775017Z     compiled=False,
2025-05-07T20:33:40.4775226Z )
2025-05-07T20:33:40.4775548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4776050Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:40.4776329Z 
2025-05-07T20:33:40.4776409Z     @given(
2025-05-07T20:33:40.4776647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4776966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4777279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4777612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4777950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4778242Z     )
2025-05-07T20:33:40.4778597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4779087Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4779364Z         self,
2025-05-07T20:33:40.4779563Z         T: int,
2025-05-07T20:33:40.4779774Z         D: int,
2025-05-07T20:33:40.4780003Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4780282Z         contiguous: bool,
2025-05-07T20:33:40.4780531Z         compiled: bool,
2025-05-07T20:33:40.4780762Z     ) -> None:
2025-05-07T20:33:40.4780980Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4781231Z     
2025-05-07T20:33:40.4781512Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4783596Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.4785547Z 
2025-05-07T20:33:40.4785679Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.4785894Z 
2025-05-07T20:33:40.4785998Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4786425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4786839Z     T=2048,
2025-05-07T20:33:40.4787028Z     D=7168,
2025-05-07T20:33:40.4787228Z     scale_ub=1200.0,
2025-05-07T20:33:40.4787457Z     contiguous=True,
2025-05-07T20:33:40.4787679Z     compiled=True,
2025-05-07T20:33:40.4787888Z )
2025-05-07T20:33:40.4788212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4788712Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:40.4789036Z 
2025-05-07T20:33:40.4789119Z     @given(
2025-05-07T20:33:40.4789385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4789731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4790078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4790414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4790751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4791037Z     )
2025-05-07T20:33:40.4791389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4791838Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4792088Z         self,
2025-05-07T20:33:40.4792282Z         T: int,
2025-05-07T20:33:40.4792488Z         D: int,
2025-05-07T20:33:40.4792712Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4792981Z         contiguous: bool,
2025-05-07T20:33:40.4793221Z         compiled: bool,
2025-05-07T20:33:40.4793439Z     ) -> None:
2025-05-07T20:33:40.4793652Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4793897Z     
2025-05-07T20:33:40.4794213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4794564Z     
2025-05-07T20:33:40.4794771Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4795073Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4797121Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.4798982Z 
2025-05-07T20:33:40.4799113Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:40.4799358Z 
2025-05-07T20:33:40.4799482Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4799899Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4800310Z     T=2048,
2025-05-07T20:33:40.4800498Z     D=7168,
2025-05-07T20:33:40.4800694Z     scale_ub=None,
2025-05-07T20:33:40.4800911Z     contiguous=True,
2025-05-07T20:33:40.4801133Z     compiled=False,
2025-05-07T20:33:40.4801343Z )
2025-05-07T20:33:40.5941843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.5942390Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.5942677Z 
2025-05-07T20:33:40.5942759Z     @given(
2025-05-07T20:33:40.5943002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.5943312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.5943623Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.5943971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.5944430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.5944716Z     )
2025-05-07T20:33:40.5945070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.5945522Z     def test_silu_mul_quant(
2025-05-07T20:33:40.5945772Z         self,
2025-05-07T20:33:40.5945982Z         T: int,
2025-05-07T20:33:40.5946192Z         D: int,
2025-05-07T20:33:40.5946409Z         scale_ub: Optional[float],
2025-05-07T20:33:40.5946701Z         contiguous: bool,
2025-05-07T20:33:40.5946949Z         compiled: bool,
2025-05-07T20:33:40.5947190Z     ) -> None:
2025-05-07T20:33:40.5947406Z         torch.manual_seed(2025)
2025-05-07T20:33:40.5947660Z     
2025-05-07T20:33:40.5947936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.5948283Z     
2025-05-07T20:33:40.5948485Z >       x_sign = torch.sign(x)
2025-05-07T20:33:40.5950550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.5952473Z 
2025-05-07T20:33:40.5952600Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:40.5952812Z 
2025-05-07T20:33:40.5952917Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.5953337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.5953752Z     T=1,
2025-05-07T20:33:40.5953952Z     D=7168,
2025-05-07T20:33:40.5954148Z     scale_ub=1200.0,
2025-05-07T20:33:40.5954378Z     contiguous=True,
2025-05-07T20:33:40.5954668Z     compiled=False,
2025-05-07T20:33:40.5954875Z )
2025-05-07T20:33:40.5955201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.5955767Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.5956034Z 
2025-05-07T20:33:40.5956119Z     @given(
2025-05-07T20:33:40.5956358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.5956679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.5956985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.5957322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.5957657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.5957952Z     )
2025-05-07T20:33:40.5958302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.5958750Z     def test_silu_mul_quant(
2025-05-07T20:33:40.5959003Z         self,
2025-05-07T20:33:40.5959204Z         T: int,
2025-05-07T20:33:40.5959414Z         D: int,
2025-05-07T20:33:40.5959644Z         scale_ub: Optional[float],
2025-05-07T20:33:40.5959917Z         contiguous: bool,
2025-05-07T20:33:40.5960168Z         compiled: bool,
2025-05-07T20:33:40.5960396Z     ) -> None:
2025-05-07T20:33:40.5960610Z         torch.manual_seed(2025)
2025-05-07T20:33:40.5960860Z     
2025-05-07T20:33:40.5961138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.5961481Z     
2025-05-07T20:33:40.5961682Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.5961976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.5962295Z         x = x_sign * x_clamp
2025-05-07T20:33:40.5962548Z         x0 = x[:, :D]
2025-05-07T20:33:40.5962770Z         x1 = x[:, D:]
2025-05-07T20:33:40.5962984Z     
2025-05-07T20:33:40.5963174Z         if contiguous:
2025-05-07T20:33:40.5963414Z             x0 = x0.contiguous()
2025-05-07T20:33:40.5963688Z             x1 = x1.contiguous()
2025-05-07T20:33:40.5963982Z     
2025-05-07T20:33:40.5964191Z         if scale_ub is not None:
2025-05-07T20:33:40.5964469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.5964809Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.5965128Z             )
2025-05-07T20:33:40.5965332Z         else:
2025-05-07T20:33:40.5965697Z             scale_ub_tensor = None
2025-05-07T20:33:40.5965956Z     
2025-05-07T20:33:40.5966197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.5966516Z             op = silu_mul_quant
2025-05-07T20:33:40.5966774Z             if compiled:
2025-05-07T20:33:40.5967031Z                 op = torch.compile(op)
2025-05-07T20:33:40.5967351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5967626Z     
2025-05-07T20:33:40.5967826Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.5967991Z 
2025-05-07T20:33:40.5968099Z moe/activation_test.py:117: 
2025-05-07T20:33:40.5968474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5968829Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.5969118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5969875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.5970564Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.5971105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.5971792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.5972455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.5972993Z     kernel = self.compile(
2025-05-07T20:33:40.5973551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.5974275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.5974678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5974922Z 
2025-05-07T20:33:40.5975134Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa7b7770>
2025-05-07T20:33:40.5976228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.5977609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa7f65c0>}
2025-05-07T20:33:40.5978957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.5979997Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa530270>
2025-05-07T20:33:40.5980294Z 
2025-05-07T20:33:40.5980467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.5980997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.5981466Z                            module_map=module_map)
2025-05-07T20:33:40.5981835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.5982194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.5982458Z E       ^
2025-05-07T20:33:40.5982922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.5983381Z 
2025-05-07T20:33:40.5983799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.5984311Z 
2025-05-07T20:33:40.5984496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.5984921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.5985325Z     T=128,
2025-05-07T20:33:40.5985521Z     D=5120,
2025-05-07T20:33:40.5985723Z     scale_ub=None,
2025-05-07T20:33:40.5985937Z     contiguous=True,
2025-05-07T20:33:40.5986168Z     compiled=False,
2025-05-07T20:33:40.5986380Z )
2025-05-07T20:33:40.6664634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.6665203Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.6665718Z 
2025-05-07T20:33:40.6665820Z     @given(
2025-05-07T20:33:40.6666061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.6666384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.6666692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.6667158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.6667507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.6667797Z     )
2025-05-07T20:33:40.6668153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.6668667Z     def test_silu_mul_quant(
2025-05-07T20:33:40.6668918Z         self,
2025-05-07T20:33:40.6669120Z         T: int,
2025-05-07T20:33:40.6669324Z         D: int,
2025-05-07T20:33:40.6669549Z         scale_ub: Optional[float],
2025-05-07T20:33:40.6669824Z         contiguous: bool,
2025-05-07T20:33:40.6670077Z         compiled: bool,
2025-05-07T20:33:40.6670315Z     ) -> None:
2025-05-07T20:33:40.6670532Z         torch.manual_seed(2025)
2025-05-07T20:33:40.6670779Z     
2025-05-07T20:33:40.6671052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.6671394Z     
2025-05-07T20:33:40.6671594Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.6671892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.6672208Z         x = x_sign * x_clamp
2025-05-07T20:33:40.6672518Z         x0 = x[:, :D]
2025-05-07T20:33:40.6672741Z         x1 = x[:, D:]
2025-05-07T20:33:40.6672949Z     
2025-05-07T20:33:40.6673140Z         if contiguous:
2025-05-07T20:33:40.6673379Z             x0 = x0.contiguous()
2025-05-07T20:33:40.6673646Z             x1 = x1.contiguous()
2025-05-07T20:33:40.6673886Z     
2025-05-07T20:33:40.6674083Z         if scale_ub is not None:
2025-05-07T20:33:40.6674363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.6674699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.6675018Z             )
2025-05-07T20:33:40.6675216Z         else:
2025-05-07T20:33:40.6675431Z             scale_ub_tensor = None
2025-05-07T20:33:40.6675772Z     
2025-05-07T20:33:40.6676012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.6676326Z             op = silu_mul_quant
2025-05-07T20:33:40.6676589Z             if compiled:
2025-05-07T20:33:40.6676844Z                 op = torch.compile(op)
2025-05-07T20:33:40.6677142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.6677426Z     
2025-05-07T20:33:40.6677626Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.6677794Z 
2025-05-07T20:33:40.6677900Z moe/activation_test.py:117: 
2025-05-07T20:33:40.6678199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.6678537Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.6678831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.6679519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.6680221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.6680763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.6681455Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.6682196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.6682752Z     kernel = self.compile(
2025-05-07T20:33:40.6683298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.6683953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.6684355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.6684592Z 
2025-05-07T20:33:40.6684801Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f5b80>
2025-05-07T20:33:40.6685891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.6687316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa7f74c0>}
2025-05-07T20:33:40.6688665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.6689737Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa5df230>
2025-05-07T20:33:40.6690025Z 
2025-05-07T20:33:40.6690201Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.6690728Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.6691202Z                            module_map=module_map)
2025-05-07T20:33:40.6691570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.6691934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.6692207Z E       ^
2025-05-07T20:33:40.6692719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.6693176Z 
2025-05-07T20:33:40.6693601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.6694112Z 
2025-05-07T20:33:40.6694223Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.6694636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.6695045Z     T=128,
2025-05-07T20:33:40.6695248Z     D=7168,
2025-05-07T20:33:40.6695442Z     scale_ub=None,
2025-05-07T20:33:40.6695667Z     contiguous=True,
2025-05-07T20:33:40.6695900Z     compiled=False,
2025-05-07T20:33:40.6696104Z )
2025-05-07T20:33:40.6696428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.6696928Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.6697207Z 
2025-05-07T20:33:40.6697292Z     @given(
2025-05-07T20:33:40.6697533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.6697862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.6698178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.6698516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.6698851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.6699150Z     )
2025-05-07T20:33:40.6705328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.6705799Z     def test_silu_mul_quant(
2025-05-07T20:33:40.6706058Z         self,
2025-05-07T20:33:40.6706265Z         T: int,
2025-05-07T20:33:40.6706467Z         D: int,
2025-05-07T20:33:40.6706698Z         scale_ub: Optional[float],
2025-05-07T20:33:40.6706984Z         contiguous: bool,
2025-05-07T20:33:40.6707237Z         compiled: bool,
2025-05-07T20:33:40.6707464Z     ) -> None:
2025-05-07T20:33:40.6707694Z         torch.manual_seed(2025)
2025-05-07T20:33:40.6708024Z     
2025-05-07T20:33:40.6708304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.6708659Z     
2025-05-07T20:33:40.6708857Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.6709148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.6709492Z         x = x_sign * x_clamp
2025-05-07T20:33:40.6709753Z         x0 = x[:, :D]
2025-05-07T20:33:40.6709975Z         x1 = x[:, D:]
2025-05-07T20:33:40.6710182Z     
2025-05-07T20:33:40.6710367Z         if contiguous:
2025-05-07T20:33:40.6710603Z             x0 = x0.contiguous()
2025-05-07T20:33:40.6710870Z             x1 = x1.contiguous()
2025-05-07T20:33:40.6711112Z     
2025-05-07T20:33:40.6711305Z         if scale_ub is not None:
2025-05-07T20:33:40.6711583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.6711915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.6712287Z             )
2025-05-07T20:33:40.6712485Z         else:
2025-05-07T20:33:40.6712707Z             scale_ub_tensor = None
2025-05-07T20:33:40.6712959Z     
2025-05-07T20:33:40.6713193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.6713556Z             op = silu_mul_quant
2025-05-07T20:33:40.6713803Z             if compiled:
2025-05-07T20:33:40.6714052Z                 op = torch.compile(op)
2025-05-07T20:33:40.6714351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.6714621Z     
2025-05-07T20:33:40.6714822Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.6714988Z 
2025-05-07T20:33:40.6715093Z moe/activation_test.py:117: 
2025-05-07T20:33:40.6715391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.6715801Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.6716085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.6716781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.6717517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.6718058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.6718750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.6719412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.6719945Z     kernel = self.compile(
2025-05-07T20:33:40.6720487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.6721147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.6721542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.6721778Z 
2025-05-07T20:33:40.6721990Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f4fb0>
2025-05-07T20:33:40.6723081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.6724465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa168540>}
2025-05-07T20:33:40.6725803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.6726835Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa11f230>
2025-05-07T20:33:40.6727131Z 
2025-05-07T20:33:40.6727298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.6727830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.6728348Z                            module_map=module_map)
2025-05-07T20:33:40.6728716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.6729076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.6729371Z E       ^
2025-05-07T20:33:40.6729850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.6730306Z 
2025-05-07T20:33:40.6730733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.6731249Z 
2025-05-07T20:33:40.6731356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.6731771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.6732172Z     T=2048,
2025-05-07T20:33:40.6732364Z     D=7168,
2025-05-07T20:33:40.6732559Z     scale_ub=1200.0,
2025-05-07T20:33:40.6732825Z     contiguous=True,
2025-05-07T20:33:40.6733055Z     compiled=False,
2025-05-07T20:33:40.6733263Z )
2025-05-07T20:33:40.7540617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.7541268Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.7541553Z 
2025-05-07T20:33:40.7541636Z     @given(
2025-05-07T20:33:40.7541884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.7542226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.7542587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.7543021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.7543453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.7543760Z     )
2025-05-07T20:33:40.7544116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.7544579Z     def test_silu_mul_quant(
2025-05-07T20:33:40.7544829Z         self,
2025-05-07T20:33:40.7545039Z         T: int,
2025-05-07T20:33:40.7545331Z         D: int,
2025-05-07T20:33:40.7545558Z         scale_ub: Optional[float],
2025-05-07T20:33:40.7545839Z         contiguous: bool,
2025-05-07T20:33:40.7546086Z         compiled: bool,
2025-05-07T20:33:40.7546315Z     ) -> None:
2025-05-07T20:33:40.7546540Z         torch.manual_seed(2025)
2025-05-07T20:33:40.7546791Z     
2025-05-07T20:33:40.7547068Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.7549145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.7551063Z 
2025-05-07T20:33:40.7551187Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.7551409Z 
2025-05-07T20:33:40.7551516Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.7551935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.7552343Z     T=1,
2025-05-07T20:33:40.7552540Z     D=5120,
2025-05-07T20:33:40.7552747Z     scale_ub=1200.0,
2025-05-07T20:33:40.7552979Z     contiguous=True,
2025-05-07T20:33:40.7553200Z     compiled=False,
2025-05-07T20:33:40.7553419Z )
2025-05-07T20:33:40.7553744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.7554237Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.7554511Z 
2025-05-07T20:33:40.7554597Z     @given(
2025-05-07T20:33:40.7554836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.7555218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.7555538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.7555968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.7556301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.7556595Z     )
2025-05-07T20:33:40.7556945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.7557390Z     def test_silu_mul_quant(
2025-05-07T20:33:40.7557632Z         self,
2025-05-07T20:33:40.7557827Z         T: int,
2025-05-07T20:33:40.7558030Z         D: int,
2025-05-07T20:33:40.7558254Z         scale_ub: Optional[float],
2025-05-07T20:33:40.7558529Z         contiguous: bool,
2025-05-07T20:33:40.7558770Z         compiled: bool,
2025-05-07T20:33:40.7558990Z     ) -> None:
2025-05-07T20:33:40.7559217Z         torch.manual_seed(2025)
2025-05-07T20:33:40.7559501Z     
2025-05-07T20:33:40.7559842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.7560196Z     
2025-05-07T20:33:40.7560400Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.7560689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.7561046Z         x = x_sign * x_clamp
2025-05-07T20:33:40.7561291Z         x0 = x[:, :D]
2025-05-07T20:33:40.7561504Z         x1 = x[:, D:]
2025-05-07T20:33:40.7561714Z     
2025-05-07T20:33:40.7561907Z         if contiguous:
2025-05-07T20:33:40.7562135Z             x0 = x0.contiguous()
2025-05-07T20:33:40.7562397Z             x1 = x1.contiguous()
2025-05-07T20:33:40.7562641Z     
2025-05-07T20:33:40.7562840Z         if scale_ub is not None:
2025-05-07T20:33:40.7563114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.7563453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.7563772Z             )
2025-05-07T20:33:40.7563963Z         else:
2025-05-07T20:33:40.7564174Z             scale_ub_tensor = None
2025-05-07T20:33:40.7564431Z     
2025-05-07T20:33:40.7564710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.7565030Z             op = silu_mul_quant
2025-05-07T20:33:40.7565302Z             if compiled:
2025-05-07T20:33:40.7565801Z                 op = torch.compile(op)
2025-05-07T20:33:40.7566099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.7566376Z     
2025-05-07T20:33:40.7566579Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.7566746Z 
2025-05-07T20:33:40.7566850Z moe/activation_test.py:117: 
2025-05-07T20:33:40.7567148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.7567479Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.7567771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.7568457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.7569148Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.7569700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.7570386Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.7571058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.7571594Z     kernel = self.compile(
2025-05-07T20:33:40.7572136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.7572794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.7573189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.7573417Z 
2025-05-07T20:33:40.7573628Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f66f0>
2025-05-07T20:33:40.7574724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.7576178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa169b20>}
2025-05-07T20:33:40.7577531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.7578561Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa1dbbf0>
2025-05-07T20:33:40.7578855Z 
2025-05-07T20:33:40.7579032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.7579562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.7580106Z                            module_map=module_map)
2025-05-07T20:33:40.7580489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.7580853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.7581198Z E       ^
2025-05-07T20:33:40.7581660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.7582117Z 
2025-05-07T20:33:40.7582536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.7583048Z 
2025-05-07T20:33:40.7583158Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.7583569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.7583972Z     T=2048,
2025-05-07T20:33:40.7584169Z     D=5120,
2025-05-07T20:33:40.7584358Z     scale_ub=None,
2025-05-07T20:33:40.7584573Z     contiguous=True,
2025-05-07T20:33:40.7584796Z     compiled=False,
2025-05-07T20:33:40.7585001Z )
2025-05-07T20:33:40.7585388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.7585892Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.7586168Z 
2025-05-07T20:33:40.7586246Z     @given(
2025-05-07T20:33:40.7586476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.7586798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.7587115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.7587452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.7587795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.7588081Z     )
2025-05-07T20:33:40.7588426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.7588875Z     def test_silu_mul_quant(
2025-05-07T20:33:40.7589128Z         self,
2025-05-07T20:33:40.7589324Z         T: int,
2025-05-07T20:33:40.7589522Z         D: int,
2025-05-07T20:33:40.7589757Z         scale_ub: Optional[float],
2025-05-07T20:33:40.7590034Z         contiguous: bool,
2025-05-07T20:33:40.7590283Z         compiled: bool,
2025-05-07T20:33:40.7590506Z     ) -> None:
2025-05-07T20:33:40.7590725Z         torch.manual_seed(2025)
2025-05-07T20:33:40.7590972Z     
2025-05-07T20:33:40.7591251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.7591598Z     
2025-05-07T20:33:40.7591805Z >       x_sign = torch.sign(x)
2025-05-07T20:33:40.7593772Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.7595741Z 
2025-05-07T20:33:40.7595860Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:40.7596078Z 
2025-05-07T20:33:40.7596188Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.7596603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.7597009Z     T=16384,
2025-05-07T20:33:40.7597208Z     D=5120,
2025-05-07T20:33:40.7597397Z     scale_ub=None,
2025-05-07T20:33:40.7597611Z     contiguous=True,
2025-05-07T20:33:40.7597835Z     compiled=False,
2025-05-07T20:33:40.7598039Z )
2025-05-07T20:33:40.8363660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8364880Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.8365868Z 
2025-05-07T20:33:40.8366029Z     @given(
2025-05-07T20:33:40.8366484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8367311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8367931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8368586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8369295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8369580Z     )
2025-05-07T20:33:40.8369928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8370370Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8370614Z         self,
2025-05-07T20:33:40.8370806Z         T: int,
2025-05-07T20:33:40.8371002Z         D: int,
2025-05-07T20:33:40.8371214Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8371490Z         contiguous: bool,
2025-05-07T20:33:40.8371729Z         compiled: bool,
2025-05-07T20:33:40.8371951Z     ) -> None:
2025-05-07T20:33:40.8372173Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8372418Z     
2025-05-07T20:33:40.8372699Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8374821Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.8376695Z 
2025-05-07T20:33:40.8376817Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.8377036Z 
2025-05-07T20:33:40.8377144Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8377567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8377964Z     T=4096,
2025-05-07T20:33:40.8378159Z     D=5120,
2025-05-07T20:33:40.8378360Z     scale_ub=None,
2025-05-07T20:33:40.8378575Z     contiguous=True,
2025-05-07T20:33:40.8378808Z     compiled=False,
2025-05-07T20:33:40.8379015Z )
2025-05-07T20:33:40.8379335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8379828Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.8380106Z 
2025-05-07T20:33:40.8380187Z     @given(
2025-05-07T20:33:40.8380418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8380727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8381039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8381370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8381698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8381987Z     )
2025-05-07T20:33:40.8382339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8382782Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8383087Z         self,
2025-05-07T20:33:40.8383286Z         T: int,
2025-05-07T20:33:40.8383489Z         D: int,
2025-05-07T20:33:40.8383703Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8383978Z         contiguous: bool,
2025-05-07T20:33:40.8384216Z         compiled: bool,
2025-05-07T20:33:40.8384434Z     ) -> None:
2025-05-07T20:33:40.8384656Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8384900Z     
2025-05-07T20:33:40.8385168Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8387263Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.8389128Z 
2025-05-07T20:33:40.8389286Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.8389502Z 
2025-05-07T20:33:40.8389605Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8390018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8390421Z     T=2048,
2025-05-07T20:33:40.8390613Z     D=5120,
2025-05-07T20:33:40.8390808Z     scale_ub=None,
2025-05-07T20:33:40.8391023Z     contiguous=False,
2025-05-07T20:33:40.8391246Z     compiled=False,
2025-05-07T20:33:40.8391450Z )
2025-05-07T20:33:40.8391768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8392263Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:40.8392536Z 
2025-05-07T20:33:40.8392623Z     @given(
2025-05-07T20:33:40.8392857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8393214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8393523Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8393857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8394181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8394473Z     )
2025-05-07T20:33:40.8394822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8395263Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8395507Z         self,
2025-05-07T20:33:40.8395777Z         T: int,
2025-05-07T20:33:40.8395970Z         D: int,
2025-05-07T20:33:40.8396188Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8396466Z         contiguous: bool,
2025-05-07T20:33:40.8396708Z         compiled: bool,
2025-05-07T20:33:40.8396926Z     ) -> None:
2025-05-07T20:33:40.8397141Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8397381Z     
2025-05-07T20:33:40.8397662Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8399715Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.8401575Z 
2025-05-07T20:33:40.8401692Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.8401906Z 
2025-05-07T20:33:40.8402021Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8402428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8402837Z     T=4096,
2025-05-07T20:33:40.8403081Z     D=7168,
2025-05-07T20:33:40.8403273Z     scale_ub=None,
2025-05-07T20:33:40.8403490Z     contiguous=True,
2025-05-07T20:33:40.8403714Z     compiled=True,
2025-05-07T20:33:40.8403921Z )
2025-05-07T20:33:40.8404239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8404728Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:40.8404997Z 
2025-05-07T20:33:40.8405079Z     @given(
2025-05-07T20:33:40.8405303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8405617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8405924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8406249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8406576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8406862Z     )
2025-05-07T20:33:40.8407251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8407700Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8407944Z         self,
2025-05-07T20:33:40.8408137Z         T: int,
2025-05-07T20:33:40.8408378Z         D: int,
2025-05-07T20:33:40.8408596Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8408865Z         contiguous: bool,
2025-05-07T20:33:40.8409103Z         compiled: bool,
2025-05-07T20:33:40.8409325Z     ) -> None:
2025-05-07T20:33:40.8409546Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8409783Z     
2025-05-07T20:33:40.8410052Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8412147Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.8414018Z 
2025-05-07T20:33:40.8414136Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.8414349Z 
2025-05-07T20:33:40.8414462Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8414879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8415287Z     T=2048,
2025-05-07T20:33:40.8415480Z     D=5120,
2025-05-07T20:33:40.8415672Z     scale_ub=1200.0,
2025-05-07T20:33:40.8415905Z     contiguous=False,
2025-05-07T20:33:40.8416129Z     compiled=False,
2025-05-07T20:33:40.8416332Z )
2025-05-07T20:33:40.8416655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8417156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:40.8417437Z 
2025-05-07T20:33:40.8417516Z     @given(
2025-05-07T20:33:40.8417750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8418063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8418381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8418709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8419043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8419334Z     )
2025-05-07T20:33:40.8419678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8420121Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8420362Z         self,
2025-05-07T20:33:40.8420554Z         T: int,
2025-05-07T20:33:40.8420758Z         D: int,
2025-05-07T20:33:40.8420979Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8421245Z         contiguous: bool,
2025-05-07T20:33:40.8421489Z         compiled: bool,
2025-05-07T20:33:40.8421707Z     ) -> None:
2025-05-07T20:33:40.8421926Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8422221Z     
2025-05-07T20:33:40.8422500Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8424554Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.8426412Z 
2025-05-07T20:33:40.8426536Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.8426746Z 
2025-05-07T20:33:40.8426850Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8427331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8427743Z     T=4096,
2025-05-07T20:33:40.8427932Z     D=7168,
2025-05-07T20:33:40.8428127Z     scale_ub=1200.0,
2025-05-07T20:33:40.8428391Z     contiguous=True,
2025-05-07T20:33:40.8428609Z     compiled=False,
2025-05-07T20:33:40.8428809Z )
2025-05-07T20:33:40.9506984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9507498Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.9507784Z 
2025-05-07T20:33:40.9507869Z     @given(
2025-05-07T20:33:40.9522480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9522924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9523483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9523887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9524289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9524588Z     )
2025-05-07T20:33:40.9525126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9525590Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9525849Z         self,
2025-05-07T20:33:40.9526049Z         T: int,
2025-05-07T20:33:40.9526256Z         D: int,
2025-05-07T20:33:40.9526480Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9526755Z         contiguous: bool,
2025-05-07T20:33:40.9527007Z         compiled: bool,
2025-05-07T20:33:40.9527250Z     ) -> None:
2025-05-07T20:33:40.9527472Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9527729Z     
2025-05-07T20:33:40.9528003Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9530083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.9531961Z 
2025-05-07T20:33:40.9532090Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.9532305Z 
2025-05-07T20:33:40.9532412Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9532833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9533245Z     T=16384,
2025-05-07T20:33:40.9533444Z     D=7168,
2025-05-07T20:33:40.9533650Z     scale_ub=None,
2025-05-07T20:33:40.9533880Z     contiguous=False,
2025-05-07T20:33:40.9534109Z     compiled=True,
2025-05-07T20:33:40.9534326Z )
2025-05-07T20:33:40.9534652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9535155Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:40.9535537Z 
2025-05-07T20:33:40.9535622Z     @given(
2025-05-07T20:33:40.9535864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9536191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9536499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9536841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9537179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9537473Z     )
2025-05-07T20:33:40.9537829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9538277Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9538525Z         self,
2025-05-07T20:33:40.9538723Z         T: int,
2025-05-07T20:33:40.9538929Z         D: int,
2025-05-07T20:33:40.9539155Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9539431Z         contiguous: bool,
2025-05-07T20:33:40.9539830Z         compiled: bool,
2025-05-07T20:33:40.9540100Z     ) -> None:
2025-05-07T20:33:40.9540425Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9540734Z     
2025-05-07T20:33:40.9544277Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9546421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.9569034Z 
2025-05-07T20:33:40.9569157Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.9569378Z 
2025-05-07T20:33:40.9569496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9570001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9570403Z     T=4096,
2025-05-07T20:33:40.9570591Z     D=7168,
2025-05-07T20:33:40.9570779Z     scale_ub=None,
2025-05-07T20:33:40.9570985Z     contiguous=True,
2025-05-07T20:33:40.9571201Z     compiled=False,
2025-05-07T20:33:40.9571402Z )
2025-05-07T20:33:40.9571713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9572208Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.9572476Z 
2025-05-07T20:33:40.9572558Z     @given(
2025-05-07T20:33:40.9572779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9573083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9573428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9573753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9574080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9574371Z     )
2025-05-07T20:33:40.9574716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9575155Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9575399Z         self,
2025-05-07T20:33:40.9575590Z         T: int,
2025-05-07T20:33:40.9575783Z         D: int,
2025-05-07T20:33:40.9575993Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9576263Z         contiguous: bool,
2025-05-07T20:33:40.9576496Z         compiled: bool,
2025-05-07T20:33:40.9576719Z     ) -> None:
2025-05-07T20:33:40.9576927Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9577165Z     
2025-05-07T20:33:40.9577440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9579550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.9581491Z 
2025-05-07T20:33:40.9581609Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.9581816Z 
2025-05-07T20:33:40.9581924Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9582325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9582721Z     T=16384,
2025-05-07T20:33:40.9582917Z     D=7168,
2025-05-07T20:33:40.9583104Z     scale_ub=None,
2025-05-07T20:33:40.9583315Z     contiguous=True,
2025-05-07T20:33:40.9583532Z     compiled=False,
2025-05-07T20:33:40.9583727Z )
2025-05-07T20:33:40.9584104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9584602Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.9584879Z 
2025-05-07T20:33:40.9584954Z     @given(
2025-05-07T20:33:40.9585240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9585549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9585852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9586180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9586512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9586802Z     )
2025-05-07T20:33:40.9587153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9587588Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9587841Z         self,
2025-05-07T20:33:40.9588040Z         T: int,
2025-05-07T20:33:40.9588237Z         D: int,
2025-05-07T20:33:40.9588460Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9588738Z         contiguous: bool,
2025-05-07T20:33:40.9589022Z         compiled: bool,
2025-05-07T20:33:40.9589254Z     ) -> None:
2025-05-07T20:33:40.9589475Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9589714Z     
2025-05-07T20:33:40.9589987Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9592038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.9593899Z 
2025-05-07T20:33:40.9594027Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.9594239Z 
2025-05-07T20:33:40.9594349Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9594756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9595167Z     T=16384,
2025-05-07T20:33:40.9595363Z     D=7168,
2025-05-07T20:33:40.9595554Z     scale_ub=1200.0,
2025-05-07T20:33:40.9595872Z     contiguous=True,
2025-05-07T20:33:40.9596097Z     compiled=False,
2025-05-07T20:33:40.9596297Z )
2025-05-07T20:33:40.9596618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9597115Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:40.9597393Z 
2025-05-07T20:33:40.9597473Z     @given(
2025-05-07T20:33:40.9597706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9598020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9598329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9598661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9599046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9599338Z     )
2025-05-07T20:33:40.9599729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9600180Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9600421Z         self,
2025-05-07T20:33:40.9600616Z         T: int,
2025-05-07T20:33:40.9600821Z         D: int,
2025-05-07T20:33:40.9601045Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9601312Z         contiguous: bool,
2025-05-07T20:33:40.9601553Z         compiled: bool,
2025-05-07T20:33:40.9601782Z     ) -> None:
2025-05-07T20:33:40.9602003Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9602253Z     
2025-05-07T20:33:40.9602535Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9604645Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:40.9606547Z 
2025-05-07T20:33:40.9606677Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:40.9606887Z 
2025-05-07T20:33:40.9606996Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9607415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9607824Z     T=128,
2025-05-07T20:33:40.9608015Z     D=5120,
2025-05-07T20:33:40.9608223Z     scale_ub=1200.0,
2025-05-07T20:33:40.9608452Z     contiguous=False,
2025-05-07T20:33:40.9608684Z     compiled=False,
2025-05-07T20:33:40.9608887Z )
2025-05-07T20:33:41.0871824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.0872384Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.0872672Z 
2025-05-07T20:33:41.0872755Z     @given(
2025-05-07T20:33:41.0872997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.0873316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.0873628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.0873966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.0874300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.0874594Z     )
2025-05-07T20:33:41.0874950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.0875399Z     def test_silu_mul_quant(
2025-05-07T20:33:41.0875654Z         self,
2025-05-07T20:33:41.0875917Z         T: int,
2025-05-07T20:33:41.0876132Z         D: int,
2025-05-07T20:33:41.0876362Z         scale_ub: Optional[float],
2025-05-07T20:33:41.0876644Z         contiguous: bool,
2025-05-07T20:33:41.0876896Z         compiled: bool,
2025-05-07T20:33:41.0877133Z     ) -> None:
2025-05-07T20:33:41.0877350Z         torch.manual_seed(2025)
2025-05-07T20:33:41.0877593Z     
2025-05-07T20:33:41.0877870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.0878214Z     
2025-05-07T20:33:41.0878420Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.0878717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.0879024Z         x = x_sign * x_clamp
2025-05-07T20:33:41.0879269Z         x0 = x[:, :D]
2025-05-07T20:33:41.0879505Z         x1 = x[:, D:]
2025-05-07T20:33:41.0879714Z     
2025-05-07T20:33:41.0879915Z         if contiguous:
2025-05-07T20:33:41.0880155Z             x0 = x0.contiguous()
2025-05-07T20:33:41.0880414Z             x1 = x1.contiguous()
2025-05-07T20:33:41.0880660Z     
2025-05-07T20:33:41.0880859Z         if scale_ub is not None:
2025-05-07T20:33:41.0881208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.0881549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.0881868Z             )
2025-05-07T20:33:41.0882070Z         else:
2025-05-07T20:33:41.0882283Z             scale_ub_tensor = None
2025-05-07T20:33:41.0882541Z     
2025-05-07T20:33:41.0882777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.0883092Z             op = silu_mul_quant
2025-05-07T20:33:41.0883346Z             if compiled:
2025-05-07T20:33:41.0883597Z                 op = torch.compile(op)
2025-05-07T20:33:41.0883896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.0884176Z     
2025-05-07T20:33:41.0884380Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.0884544Z 
2025-05-07T20:33:41.0884646Z moe/activation_test.py:117: 
2025-05-07T20:33:41.0884949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.0885387Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.0885681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.0886373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.0887136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.0887680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.0888364Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.0889033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.0889576Z     kernel = self.compile(
2025-05-07T20:33:41.0890126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.0890787Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.0891241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.0891473Z 
2025-05-07T20:33:41.0891692Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa3ff0b0>
2025-05-07T20:33:41.0892788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.0894167Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa2b4860>}
2025-05-07T20:33:41.0895525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.0896562Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa2a3730>
2025-05-07T20:33:41.0896857Z 
2025-05-07T20:33:41.0897036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.0897564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.0898043Z                            module_map=module_map)
2025-05-07T20:33:41.0898409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.0898766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.0899026Z E       ^
2025-05-07T20:33:41.0899513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.0899974Z 
2025-05-07T20:33:41.0900394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.0900909Z 
2025-05-07T20:33:41.0901023Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.0901440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.0901923Z     T=2048,
2025-05-07T20:33:41.0902119Z     D=7168,
2025-05-07T20:33:41.0902327Z     scale_ub=None,
2025-05-07T20:33:41.0902551Z     contiguous=False,
2025-05-07T20:33:41.0902785Z     compiled=False,
2025-05-07T20:33:41.0902995Z )
2025-05-07T20:33:41.0903313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.0903834Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.0904109Z 
2025-05-07T20:33:41.0904199Z     @given(
2025-05-07T20:33:41.0904445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.0904758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.0905076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.0905411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.0905792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.0906093Z     )
2025-05-07T20:33:41.0906449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.0906887Z     def test_silu_mul_quant(
2025-05-07T20:33:41.0907180Z         self,
2025-05-07T20:33:41.0907380Z         T: int,
2025-05-07T20:33:41.0907576Z         D: int,
2025-05-07T20:33:41.0907796Z         scale_ub: Optional[float],
2025-05-07T20:33:41.0908070Z         contiguous: bool,
2025-05-07T20:33:41.0908307Z         compiled: bool,
2025-05-07T20:33:41.0908536Z     ) -> None:
2025-05-07T20:33:41.0908760Z         torch.manual_seed(2025)
2025-05-07T20:33:41.0909007Z     
2025-05-07T20:33:41.0909280Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.0911438Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.0913315Z 
2025-05-07T20:33:41.0913439Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.0913654Z 
2025-05-07T20:33:41.0913768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.0914182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.0914595Z     T=128,
2025-05-07T20:33:41.0914790Z     D=7168,
2025-05-07T20:33:41.0914986Z     scale_ub=1200.0,
2025-05-07T20:33:41.0915207Z     contiguous=True,
2025-05-07T20:33:41.0915431Z     compiled=True,
2025-05-07T20:33:41.0915641Z )
2025-05-07T20:33:41.1227949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.1229038Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.1229737Z 
2025-05-07T20:33:41.1229860Z     @given(
2025-05-07T20:33:41.1230111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.1230442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.1230757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.1231101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.1231444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.1231738Z     )
2025-05-07T20:33:41.1232098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.1232553Z     def test_silu_mul_quant(
2025-05-07T20:33:41.1232802Z         self,
2025-05-07T20:33:41.1232996Z         T: int,
2025-05-07T20:33:41.1233204Z         D: int,
2025-05-07T20:33:41.1233427Z         scale_ub: Optional[float],
2025-05-07T20:33:41.1233700Z         contiguous: bool,
2025-05-07T20:33:41.1233950Z         compiled: bool,
2025-05-07T20:33:41.1234291Z     ) -> None:
2025-05-07T20:33:41.1234508Z         torch.manual_seed(2025)
2025-05-07T20:33:41.1234764Z     
2025-05-07T20:33:41.1235048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.1235394Z     
2025-05-07T20:33:41.1235598Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.1235964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.1236276Z         x = x_sign * x_clamp
2025-05-07T20:33:41.1236520Z         x0 = x[:, :D]
2025-05-07T20:33:41.1236740Z         x1 = x[:, D:]
2025-05-07T20:33:41.1236951Z     
2025-05-07T20:33:41.1237141Z         if contiguous:
2025-05-07T20:33:41.1237376Z             x0 = x0.contiguous()
2025-05-07T20:33:41.1237633Z             x1 = x1.contiguous()
2025-05-07T20:33:41.1237880Z     
2025-05-07T20:33:41.1238078Z         if scale_ub is not None:
2025-05-07T20:33:41.1238349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.1238768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.1239092Z             )
2025-05-07T20:33:41.1239296Z         else:
2025-05-07T20:33:41.1239504Z             scale_ub_tensor = None
2025-05-07T20:33:41.1239827Z     
2025-05-07T20:33:41.1240066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.1240382Z             op = silu_mul_quant
2025-05-07T20:33:41.1240635Z             if compiled:
2025-05-07T20:33:41.1240888Z                 op = torch.compile(op)
2025-05-07T20:33:41.1241183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.1241462Z     
2025-05-07T20:33:41.1241660Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.1241824Z 
2025-05-07T20:33:41.1241925Z moe/activation_test.py:117: 
2025-05-07T20:33:41.1242222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.1242558Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.1242843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.1243464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.1244035Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.1244701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.1245388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.1245936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.1246618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.1247289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.1247819Z     kernel = self.compile(
2025-05-07T20:33:41.1248365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.1249028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.1249428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.1249670Z 
2025-05-07T20:33:41.1249879Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa067dd0>
2025-05-07T20:33:41.1250970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.1252349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa2b59e0>}
2025-05-07T20:33:41.1253699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.1254777Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa06ca30>
2025-05-07T20:33:41.1255073Z 
2025-05-07T20:33:41.1255242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.1255775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.1256254Z                            module_map=module_map)
2025-05-07T20:33:41.1256624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.1256996Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.1257262Z E       ^
2025-05-07T20:33:41.1257723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.1258180Z 
2025-05-07T20:33:41.1258593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.1259157Z 
2025-05-07T20:33:41.1259264Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.1259737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.1260183Z     T=128,
2025-05-07T20:33:41.1260375Z     D=7168,
2025-05-07T20:33:41.1260574Z     scale_ub=1200.0,
2025-05-07T20:33:41.1260793Z     contiguous=True,
2025-05-07T20:33:41.1261019Z     compiled=False,
2025-05-07T20:33:41.1261228Z )
2025-05-07T20:33:41.1261548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.1262048Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.1262330Z 
2025-05-07T20:33:41.1262411Z     @given(
2025-05-07T20:33:41.1262649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.1262960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.1263273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.1263615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.1263944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.1264285Z     )
2025-05-07T20:33:41.1264639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.1265082Z     def test_silu_mul_quant(
2025-05-07T20:33:41.1265333Z         self,
2025-05-07T20:33:41.1265793Z         T: int,
2025-05-07T20:33:41.1265990Z         D: int,
2025-05-07T20:33:41.1266212Z         scale_ub: Optional[float],
2025-05-07T20:33:41.1266491Z         contiguous: bool,
2025-05-07T20:33:41.1266733Z         compiled: bool,
2025-05-07T20:33:41.1273013Z     ) -> None:
2025-05-07T20:33:41.1273242Z         torch.manual_seed(2025)
2025-05-07T20:33:41.1273492Z     
2025-05-07T20:33:41.1273770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.1274117Z     
2025-05-07T20:33:41.1274313Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.1274606Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.1276681Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.1278549Z 
2025-05-07T20:33:41.1278674Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.1278886Z 
2025-05-07T20:33:41.1278990Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.1279404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.1279813Z     T=128,
2025-05-07T20:33:41.1280005Z     D=5120,
2025-05-07T20:33:41.1280195Z     scale_ub=1200.0,
2025-05-07T20:33:41.1280417Z     contiguous=True,
2025-05-07T20:33:41.1280788Z     compiled=True,
2025-05-07T20:33:41.1280990Z )
2025-05-07T20:33:41.1281309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.1281801Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.1282070Z 
2025-05-07T20:33:41.1282147Z     @given(
2025-05-07T20:33:41.1282381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.1282695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.1282995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.1283323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.1283653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.1283944Z     )
2025-05-07T20:33:41.1284290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.1284733Z     def test_silu_mul_quant(
2025-05-07T20:33:41.1285050Z         self,
2025-05-07T20:33:41.1285249Z         T: int,
2025-05-07T20:33:41.1285448Z         D: int,
2025-05-07T20:33:41.1285671Z         scale_ub: Optional[float],
2025-05-07T20:33:41.1286003Z         contiguous: bool,
2025-05-07T20:33:41.1286241Z         compiled: bool,
2025-05-07T20:33:41.1286463Z     ) -> None:
2025-05-07T20:33:41.1286675Z         torch.manual_seed(2025)
2025-05-07T20:33:41.1286921Z     
2025-05-07T20:33:41.1287196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.1287533Z     
2025-05-07T20:33:41.1287727Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.1288019Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.1290137Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.1292000Z 
2025-05-07T20:33:41.1292122Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.1292333Z 
2025-05-07T20:33:41.1292436Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.1292848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.1293251Z     T=128,
2025-05-07T20:33:41.1293436Z     D=7168,
2025-05-07T20:33:41.1293633Z     scale_ub=None,
2025-05-07T20:33:41.1293851Z     contiguous=True,
2025-05-07T20:33:41.1294078Z     compiled=True,
2025-05-07T20:33:41.1294287Z )
2025-05-07T20:33:41.3818099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3819128Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.3819535Z 
2025-05-07T20:33:41.3819636Z     @given(
2025-05-07T20:33:41.3819894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3820214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3820520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3820842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3821168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3821451Z     )
2025-05-07T20:33:41.3821791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3822231Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3822475Z         self,
2025-05-07T20:33:41.3822665Z         T: int,
2025-05-07T20:33:41.3822862Z         D: int,
2025-05-07T20:33:41.3823078Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3823342Z         contiguous: bool,
2025-05-07T20:33:41.3823584Z         compiled: bool,
2025-05-07T20:33:41.3823811Z     ) -> None:
2025-05-07T20:33:41.3824135Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3824382Z     
2025-05-07T20:33:41.3824649Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3826690Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.3828536Z 
2025-05-07T20:33:41.3828660Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.3828867Z 
2025-05-07T20:33:41.3843912Z FAILED
2025-05-07T20:33:41.3844253Z 
2025-05-07T20:33:41.3844446Z =================================== FAILURES ===================================
2025-05-07T20:33:41.3844922Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:41.3845439Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:41.3846067Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:41.3846620Z   |     yield
2025-05-07T20:33:41.3847062Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:41.3847767Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:41.3848536Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:41.3849329Z   |     if method() is not None:
2025-05-07T20:33:41.3849706Z   |        ^^^^^^^^
2025-05-07T20:33:41.3850830Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:41.3851828Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3852248Z   |            ^^^^^^^
2025-05-07T20:33:41.3852999Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:41.3853849Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:41.3854442Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:41.3855026Z   +-+---------------- 1 ----------------
2025-05-07T20:33:41.3855418Z     | Traceback (most recent call last):
2025-05-07T20:33:41.3856376Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:41.3857434Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3857952Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3860672Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.3863394Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:41.3863994Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3864559Z     |     T=2048,
2025-05-07T20:33:41.3864878Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:41.3865553Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:41.3866165Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:41.3866725Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:41.3867159Z     | )
2025-05-07T20:33:41.3867414Z     | 
2025-05-07T20:33:41.3868166Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:41.3868993Z     +---------------- 2 ----------------
2025-05-07T20:33:41.3869388Z     | Traceback (most recent call last):
2025-05-07T20:33:41.3870353Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:41.3871430Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3872065Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3874787Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.3877641Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:41.3878244Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3878801Z     |     T=128,
2025-05-07T20:33:41.3879103Z     |     D=7168,
2025-05-07T20:33:41.3879392Z     |     scale_ub=None,
2025-05-07T20:33:41.3879722Z     |     contiguous=True,
2025-05-07T20:33:41.3880054Z     |     compiled=True,
2025-05-07T20:33:41.3880370Z     | )
2025-05-07T20:33:41.3880711Z     | 
2025-05-07T20:33:41.3881426Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:41.3882283Z     +---------------- 3 ----------------
2025-05-07T20:33:41.3882690Z     | Traceback (most recent call last):
2025-05-07T20:33:41.3883656Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:41.3884702Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3885220Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3887433Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.3889409Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:41.3889843Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3890250Z     |     T=128,
2025-05-07T20:33:41.3890462Z     |     D=5120,
2025-05-07T20:33:41.3890678Z     |     scale_ub=1200.0,
2025-05-07T20:33:41.3890925Z     |     contiguous=True,
2025-05-07T20:33:41.3891162Z     |     compiled=True,
2025-05-07T20:33:41.3891390Z     | )
2025-05-07T20:33:41.3891575Z     | 
2025-05-07T20:33:41.3892098Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:41.3892767Z     +---------------- 4 ----------------
2025-05-07T20:33:41.3893061Z     | Traceback (most recent call last):
2025-05-07T20:33:41.3893766Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:41.3894482Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.3894772Z     |                              ^^^^^^^^
2025-05-07T20:33:41.3895413Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:41.3896106Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.3896446Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3897315Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:41.3898118Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.3898721Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:41.3899504Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.3899953Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3900700Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:41.3901808Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.3902478Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3903457Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:41.3904468Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.3905024Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3905901Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:41.3912432Z     |     fn()
2025-05-07T20:33:41.3913269Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:41.3914144Z     |     self.fn.run(
2025-05-07T20:33:41.3914880Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:41.3915758Z     |     kernel = self.compile(
2025-05-07T20:33:41.3916136Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:41.3916968Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:41.3917935Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.3918478Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3919369Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:41.3920480Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.3921149Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:41.3921669Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.3922162Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.3922536Z     | ^
2025-05-07T20:33:41.3923169Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.3924067Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:41.3924626Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:41.3925348Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3925951Z     |     T=1,  # or any other generated value
2025-05-07T20:33:41.3926394Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:41.3926870Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:41.3927370Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:41.3927883Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:41.3928310Z     | )
2025-05-07T20:33:41.3928554Z     | 
2025-05-07T20:33:41.3929350Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:41.3930232Z     +------------------------------------
2025-05-07T20:33:41.3930754Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:41.3931381Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3931945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3932490Z     T=1,
2025-05-07T20:33:41.3932757Z     D=5120,
2025-05-07T20:33:41.3933032Z     scale_ub=None,
2025-05-07T20:33:41.3933336Z     contiguous=True,
2025-05-07T20:33:41.3933639Z     compiled=True,
2025-05-07T20:33:41.3933935Z )
2025-05-07T20:33:41.3934377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3935027Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.3935386Z 
2025-05-07T20:33:41.3935500Z     @given(
2025-05-07T20:33:41.3935823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3936265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3936752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3937215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3937673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3938067Z     )
2025-05-07T20:33:41.3938551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3939163Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3939517Z         self,
2025-05-07T20:33:41.3939826Z         T: int,
2025-05-07T20:33:41.3940107Z         D: int,
2025-05-07T20:33:41.3940408Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3940793Z         contiguous: bool,
2025-05-07T20:33:41.3941136Z         compiled: bool,
2025-05-07T20:33:41.3941448Z     ) -> None:
2025-05-07T20:33:41.3941750Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3942086Z     
2025-05-07T20:33:41.3942469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3942935Z     
2025-05-07T20:33:41.3943214Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.3943615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.3944049Z         x = x_sign * x_clamp
2025-05-07T20:33:41.3944396Z         x0 = x[:, :D]
2025-05-07T20:33:41.3944700Z         x1 = x[:, D:]
2025-05-07T20:33:41.3944983Z     
2025-05-07T20:33:41.3945245Z         if contiguous:
2025-05-07T20:33:41.3945567Z             x0 = x0.contiguous()
2025-05-07T20:33:41.3945925Z             x1 = x1.contiguous()
2025-05-07T20:33:41.3946270Z     
2025-05-07T20:33:41.3946540Z         if scale_ub is not None:
2025-05-07T20:33:41.3946904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.3947365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.3947798Z             )
2025-05-07T20:33:41.3948069Z         else:
2025-05-07T20:33:41.3948374Z             scale_ub_tensor = None
2025-05-07T20:33:41.3948735Z     
2025-05-07T20:33:41.3949060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3949575Z             op = silu_mul_quant
2025-05-07T20:33:41.3949972Z             if compiled:
2025-05-07T20:33:41.3950327Z                 op = torch.compile(op)
2025-05-07T20:33:41.3950737Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3951129Z     
2025-05-07T20:33:41.3951405Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.3951788Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.3952193Z     
2025-05-07T20:33:41.3952523Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3952982Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.3953378Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.3953801Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.3954272Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.3954687Z     
2025-05-07T20:33:41.3955015Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.3955278Z 
2025-05-07T20:33:41.3955428Z moe/activation_test.py:126: 
2025-05-07T20:33:41.3955945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3956467Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.3956909Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.3957977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.3959025Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.3959844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.3960766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.3961687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.3962724Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.3963737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.3964607Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.3965709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.3966418Z     fn()
2025-05-07T20:33:41.3967100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.3967880Z     self.fn.run(
2025-05-07T20:33:41.3968507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.3969225Z     kernel = self.compile(
2025-05-07T20:33:41.3969952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.3970825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.3971366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3971694Z 
2025-05-07T20:33:41.3971984Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb3eb2a80>
2025-05-07T20:33:41.3973436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.3975325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb2735c60>}
2025-05-07T20:33:41.3977219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.3978793Z context = <triton._C.libtriton.ir.context object at 0x7f2cb29777b0>
2025-05-07T20:33:41.3979184Z 
2025-05-07T20:33:41.3979421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.3980173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.3980821Z                            module_map=module_map)
2025-05-07T20:33:41.3981314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.3981801Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.3982160Z E       ^
2025-05-07T20:33:41.3982791Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.3983393Z 
2025-05-07T20:33:41.3984040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.3984722Z 
2025-05-07T20:33:41.3984870Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3985436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3986071Z     T=2048,
2025-05-07T20:33:41.3986346Z     D=5120,
2025-05-07T20:33:41.3986619Z     scale_ub=1200.0,
2025-05-07T20:33:41.3986934Z     contiguous=True,
2025-05-07T20:33:41.3987252Z     compiled=False,
2025-05-07T20:33:41.3987539Z )
2025-05-07T20:33:41.3987987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3988665Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.3989032Z 
2025-05-07T20:33:41.3989144Z     @given(
2025-05-07T20:33:41.3989466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3989946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3990363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3990824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3991369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3991772Z     )
2025-05-07T20:33:41.3992245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3992853Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3993180Z         self,
2025-05-07T20:33:41.3993444Z         T: int,
2025-05-07T20:33:41.3993720Z         D: int,
2025-05-07T20:33:41.3994022Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3994396Z         contiguous: bool,
2025-05-07T20:33:41.3994747Z         compiled: bool,
2025-05-07T20:33:41.3995057Z     ) -> None:
2025-05-07T20:33:41.3995355Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3995794Z     
2025-05-07T20:33:41.3996167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3996631Z     
2025-05-07T20:33:41.3996900Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.3997297Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.3997735Z         x = x_sign * x_clamp
2025-05-07T20:33:41.3998075Z         x0 = x[:, :D]
2025-05-07T20:33:41.3998382Z         x1 = x[:, D:]
2025-05-07T20:33:41.3998681Z     
2025-05-07T20:33:41.4020218Z         if contiguous:
2025-05-07T20:33:41.4020561Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4020903Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4021227Z     
2025-05-07T20:33:41.4021497Z         if scale_ub is not None:
2025-05-07T20:33:41.4021861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4022307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4022717Z             )
2025-05-07T20:33:41.4022983Z         else:
2025-05-07T20:33:41.4023258Z             scale_ub_tensor = None
2025-05-07T20:33:41.4023591Z     
2025-05-07T20:33:41.4023900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4024309Z             op = silu_mul_quant
2025-05-07T20:33:41.4024647Z             if compiled:
2025-05-07T20:33:41.4025082Z                 op = torch.compile(op)
2025-05-07T20:33:41.4025463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4025829Z     
2025-05-07T20:33:41.4026094Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4026307Z 
2025-05-07T20:33:41.4026438Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4026832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4027268Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4027638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4028555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4029503Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4030225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4031218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4032135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4032915Z     kernel = self.compile(
2025-05-07T20:33:41.4033673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4034576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4035122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4035442Z 
2025-05-07T20:33:41.4035856Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb2740980>
2025-05-07T20:33:41.4037367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4039303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb258c220>}
2025-05-07T20:33:41.4041122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4042494Z context = <triton._C.libtriton.ir.context object at 0x7f2cb297b170>
2025-05-07T20:33:41.4042895Z 
2025-05-07T20:33:41.4043138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4043814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4044465Z                            module_map=module_map)
2025-05-07T20:33:41.4044970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4045452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4045795Z E       ^
2025-05-07T20:33:41.4046430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4047053Z 
2025-05-07T20:33:41.4047613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4048302Z 
2025-05-07T20:33:41.4048448Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4048988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4049523Z     T=2048,
2025-05-07T20:33:41.4049782Z     D=5120,
2025-05-07T20:33:41.4050044Z     scale_ub=1200.0,
2025-05-07T20:33:41.4050346Z     contiguous=True,
2025-05-07T20:33:41.4050644Z     compiled=True,
2025-05-07T20:33:41.4050915Z )
2025-05-07T20:33:41.4051360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4052058Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4052442Z 
2025-05-07T20:33:41.4052618Z     @given(
2025-05-07T20:33:41.4052943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4053377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4053806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4054250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4054696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4055079Z     )
2025-05-07T20:33:41.4055527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4056121Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4056465Z         self,
2025-05-07T20:33:41.4056733Z         T: int,
2025-05-07T20:33:41.4057010Z         D: int,
2025-05-07T20:33:41.4057309Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4057670Z         contiguous: bool,
2025-05-07T20:33:41.4057996Z         compiled: bool,
2025-05-07T20:33:41.4058364Z     ) -> None:
2025-05-07T20:33:41.4058655Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4058990Z     
2025-05-07T20:33:41.4059359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4059873Z     
2025-05-07T20:33:41.4060133Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4060527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4060941Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4061261Z         x0 = x[:, :D]
2025-05-07T20:33:41.4061550Z         x1 = x[:, D:]
2025-05-07T20:33:41.4061839Z     
2025-05-07T20:33:41.4062088Z         if contiguous:
2025-05-07T20:33:41.4062402Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4062760Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4063080Z     
2025-05-07T20:33:41.4063345Z         if scale_ub is not None:
2025-05-07T20:33:41.4063727Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4064182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4064621Z             )
2025-05-07T20:33:41.4064908Z         else:
2025-05-07T20:33:41.4065263Z             scale_ub_tensor = None
2025-05-07T20:33:41.4065913Z     
2025-05-07T20:33:41.4066245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4066680Z             op = silu_mul_quant
2025-05-07T20:33:41.4067015Z             if compiled:
2025-05-07T20:33:41.4067326Z                 op = torch.compile(op)
2025-05-07T20:33:41.4067630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4067908Z     
2025-05-07T20:33:41.4068112Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4068408Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4068698Z     
2025-05-07T20:33:41.4068944Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4069289Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4069585Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4069908Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4070278Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4070587Z     
2025-05-07T20:33:41.4070799Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4071006Z 
2025-05-07T20:33:41.4071109Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4071414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4071750Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4072086Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4072881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4073634Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4074186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4074875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4075787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4076516Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4077252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4077894Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4078500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4079016Z     fn()
2025-05-07T20:33:41.4079528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4080120Z     self.fn.run(
2025-05-07T20:33:41.4080666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4081209Z     kernel = self.compile(
2025-05-07T20:33:41.4081754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4082473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4082870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4083108Z 
2025-05-07T20:33:41.4083315Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0f13650>
2025-05-07T20:33:41.4084406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4085793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb258d6c0>}
2025-05-07T20:33:41.4087252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4088288Z context = <triton._C.libtriton.ir.context object at 0x7f2cb115edf0>
2025-05-07T20:33:41.4088585Z 
2025-05-07T20:33:41.4088753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4089284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4089755Z                            module_map=module_map)
2025-05-07T20:33:41.4090127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4090496Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4090767Z E       ^
2025-05-07T20:33:41.4091241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4091705Z 
2025-05-07T20:33:41.4092123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4092640Z 
2025-05-07T20:33:41.4092753Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4093168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4093578Z     T=16384,
2025-05-07T20:33:41.4093782Z     D=7168,
2025-05-07T20:33:41.4093977Z     scale_ub=1200.0,
2025-05-07T20:33:41.4094210Z     contiguous=False,
2025-05-07T20:33:41.4094445Z     compiled=False,
2025-05-07T20:33:41.4094659Z )
2025-05-07T20:33:41.4094978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4095488Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4095773Z 
2025-05-07T20:33:41.4095862Z     @given(
2025-05-07T20:33:41.4096100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4096471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4096789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4097121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4097465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4097757Z     )
2025-05-07T20:33:41.4098106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4098547Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4098797Z         self,
2025-05-07T20:33:41.4098996Z         T: int,
2025-05-07T20:33:41.4099196Z         D: int,
2025-05-07T20:33:41.4099422Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4099700Z         contiguous: bool,
2025-05-07T20:33:41.4099947Z         compiled: bool,
2025-05-07T20:33:41.4100180Z     ) -> None:
2025-05-07T20:33:41.4100402Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4100651Z     
2025-05-07T20:33:41.4100980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4101341Z     
2025-05-07T20:33:41.4101540Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4101844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4102232Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4102474Z         x0 = x[:, :D]
2025-05-07T20:33:41.4102701Z         x1 = x[:, D:]
2025-05-07T20:33:41.4102918Z     
2025-05-07T20:33:41.4103108Z         if contiguous:
2025-05-07T20:33:41.4103342Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4103615Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4103861Z     
2025-05-07T20:33:41.4104053Z         if scale_ub is not None:
2025-05-07T20:33:41.4104336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4104681Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4104989Z             )
2025-05-07T20:33:41.4105190Z         else:
2025-05-07T20:33:41.4105406Z             scale_ub_tensor = None
2025-05-07T20:33:41.4105665Z     
2025-05-07T20:33:41.4105953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4106289Z             op = silu_mul_quant
2025-05-07T20:33:41.4106542Z             if compiled:
2025-05-07T20:33:41.4106803Z                 op = torch.compile(op)
2025-05-07T20:33:41.4107107Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4107397Z     
2025-05-07T20:33:41.4107593Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4107769Z 
2025-05-07T20:33:41.4107870Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4108175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4108509Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4108796Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4109490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4110230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4110776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4111468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4112141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4112675Z     kernel = self.compile(
2025-05-07T20:33:41.4113223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4113884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4114286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4114519Z 
2025-05-07T20:33:41.4114726Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb25da8a0>
2025-05-07T20:33:41.4115893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4117317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb1450a40>}
2025-05-07T20:33:41.4118664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4119729Z context = <triton._C.libtriton.ir.context object at 0x7f2cb071d530>
2025-05-07T20:33:41.4120032Z 
2025-05-07T20:33:41.4120200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4120726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4121244Z                            module_map=module_map)
2025-05-07T20:33:41.4121615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4121974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4122280Z E       ^
2025-05-07T20:33:41.4122744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4123203Z 
2025-05-07T20:33:41.4123620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4124137Z 
2025-05-07T20:33:41.4124242Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4124470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4124556Z     T=1,
2025-05-07T20:33:41.4124636Z     D=7168,
2025-05-07T20:33:41.4124721Z     scale_ub=None,
2025-05-07T20:33:41.4124816Z     contiguous=True,
2025-05-07T20:33:41.4124901Z     compiled=True,
2025-05-07T20:33:41.4124977Z )
2025-05-07T20:33:41.4125246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4125415Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4125423Z 
2025-05-07T20:33:41.4125514Z     @given(
2025-05-07T20:33:41.4125636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4125738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4125862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4125982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4126101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4126184Z     )
2025-05-07T20:33:41.4126430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4126533Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4126615Z         self,
2025-05-07T20:33:41.4126696Z         T: int,
2025-05-07T20:33:41.4126782Z         D: int,
2025-05-07T20:33:41.4126889Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4126987Z         contiguous: bool,
2025-05-07T20:33:41.4127082Z         compiled: bool,
2025-05-07T20:33:41.4127162Z     ) -> None:
2025-05-07T20:33:41.4127266Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4127348Z     
2025-05-07T20:33:41.4127521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4127597Z     
2025-05-07T20:33:41.4127700Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4127831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4127923Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4128014Z         x0 = x[:, :D]
2025-05-07T20:33:41.4128100Z         x1 = x[:, D:]
2025-05-07T20:33:41.4128182Z     
2025-05-07T20:33:41.4128271Z         if contiguous:
2025-05-07T20:33:41.4128366Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4128464Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4128540Z     
2025-05-07T20:33:41.4128638Z         if scale_ub is not None:
2025-05-07T20:33:41.4128753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4128940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4129023Z             )
2025-05-07T20:33:41.4129109Z         else:
2025-05-07T20:33:41.4129207Z             scale_ub_tensor = None
2025-05-07T20:33:41.4129284Z     
2025-05-07T20:33:41.4129422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4129519Z             op = silu_mul_quant
2025-05-07T20:33:41.4129615Z             if compiled:
2025-05-07T20:33:41.4129720Z                 op = torch.compile(op)
2025-05-07T20:33:41.4129828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4129915Z     
2025-05-07T20:33:41.4130010Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4130133Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4130217Z     
2025-05-07T20:33:41.4130359Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4130511Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4130629Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4130753Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4130932Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4131019Z     
2025-05-07T20:33:41.4131121Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4131126Z 
2025-05-07T20:33:41.4131232Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4131364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4131475Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4131616Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4132174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4132278Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4132684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4132912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4133290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4133549Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4133923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4134097Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4134441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4134530Z     fn()
2025-05-07T20:33:41.4134933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4135023Z     self.fn.run(
2025-05-07T20:33:41.4135366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4135467Z     kernel = self.compile(
2025-05-07T20:33:41.4135852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4136032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4136162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4136166Z 
2025-05-07T20:33:41.4136381Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1405520>
2025-05-07T20:33:41.4137164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4137725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb1450360>}
2025-05-07T20:33:41.4138482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4138679Z context = <triton._C.libtriton.ir.context object at 0x7f2cb09fbdb0>
2025-05-07T20:33:41.4138684Z 
2025-05-07T20:33:41.4138859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4139124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4139243Z                            module_map=module_map)
2025-05-07T20:33:41.4139407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4139571Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4139671Z E       ^
2025-05-07T20:33:41.4140058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4140104Z 
2025-05-07T20:33:41.4140520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4140524Z 
2025-05-07T20:33:41.4140637Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4140864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4140952Z     T=4096,
2025-05-07T20:33:41.4141034Z     D=5120,
2025-05-07T20:33:41.4141119Z     scale_ub=None,
2025-05-07T20:33:41.4141212Z     contiguous=False,
2025-05-07T20:33:41.4141299Z     compiled=False,
2025-05-07T20:33:41.4141374Z )
2025-05-07T20:33:41.4141602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4141785Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4141792Z 
2025-05-07T20:33:41.4141912Z     @given(
2025-05-07T20:33:41.4142043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4142148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4142270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4142389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4142505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4142588Z     )
2025-05-07T20:33:41.4142834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4142931Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4143018Z         self,
2025-05-07T20:33:41.4143098Z         T: int,
2025-05-07T20:33:41.4143177Z         D: int,
2025-05-07T20:33:41.4143285Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4143377Z         contiguous: bool,
2025-05-07T20:33:41.4143468Z         compiled: bool,
2025-05-07T20:33:41.4143561Z     ) -> None:
2025-05-07T20:33:41.4143664Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4143748Z     
2025-05-07T20:33:41.4143921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4144000Z     
2025-05-07T20:33:41.4144101Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4144231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4144327Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4144422Z         x0 = x[:, :D]
2025-05-07T20:33:41.4144507Z         x1 = x[:, D:]
2025-05-07T20:33:41.4144582Z     
2025-05-07T20:33:41.4144676Z         if contiguous:
2025-05-07T20:33:41.4144771Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4144862Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4144945Z     
2025-05-07T20:33:41.4145039Z         if scale_ub is not None:
2025-05-07T20:33:41.4145156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4145296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4145420Z             )
2025-05-07T20:33:41.4145512Z         else:
2025-05-07T20:33:41.4145611Z             scale_ub_tensor = None
2025-05-07T20:33:41.4145689Z     
2025-05-07T20:33:41.4145827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4145921Z             op = silu_mul_quant
2025-05-07T20:33:41.4146008Z             if compiled:
2025-05-07T20:33:41.4146116Z                 op = torch.compile(op)
2025-05-07T20:33:41.4146224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4146304Z     
2025-05-07T20:33:41.4146407Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4146412Z 
2025-05-07T20:33:41.4146512Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4146649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4146755Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4146863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4147438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4147542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4148029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4148260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4148601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4148705Z     kernel = self.compile(
2025-05-07T20:33:41.4149087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4149263Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4149398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4149403Z 
2025-05-07T20:33:41.4149612Z self = <triton.compiler.compiler.ASTSource object at 0x7f2ce5af6ff0>
2025-05-07T20:33:41.4150442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4150951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb256f240>}
2025-05-07T20:33:41.4151695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4151892Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0499970>
2025-05-07T20:33:41.4151896Z 
2025-05-07T20:33:41.4152065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4152340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4152450Z                            module_map=module_map)
2025-05-07T20:33:41.4152616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4152727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4152810Z E       ^
2025-05-07T20:33:41.4153179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4153183Z 
2025-05-07T20:33:41.4153596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4153600Z 
2025-05-07T20:33:41.4153709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4153940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4154023Z     T=4096,
2025-05-07T20:33:41.4154105Z     D=7168,
2025-05-07T20:33:41.4154196Z     scale_ub=None,
2025-05-07T20:33:41.4154334Z     contiguous=False,
2025-05-07T20:33:41.4154426Z     compiled=False,
2025-05-07T20:33:41.4154502Z )
2025-05-07T20:33:41.4154727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4154908Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4154913Z 
2025-05-07T20:33:41.4154995Z     @given(
2025-05-07T20:33:41.4155117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4155224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4155342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4155462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4155583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4155737Z     )
2025-05-07T20:33:41.4155992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4156133Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4156216Z         self,
2025-05-07T20:33:41.4156304Z         T: int,
2025-05-07T20:33:41.4156383Z         D: int,
2025-05-07T20:33:41.4156488Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4156638Z         contiguous: bool,
2025-05-07T20:33:41.4156728Z         compiled: bool,
2025-05-07T20:33:41.4156809Z     ) -> None:
2025-05-07T20:33:41.4156912Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4156990Z     
2025-05-07T20:33:41.4157161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4157245Z     
2025-05-07T20:33:41.4157342Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4157476Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4157570Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4157653Z         x0 = x[:, :D]
2025-05-07T20:33:41.4157741Z         x1 = x[:, D:]
2025-05-07T20:33:41.4157816Z     
2025-05-07T20:33:41.4157903Z         if contiguous:
2025-05-07T20:33:41.4158008Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4158145Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4158223Z     
2025-05-07T20:33:41.4158324Z         if scale_ub is not None:
2025-05-07T20:33:41.4158436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4158573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4158663Z             )
2025-05-07T20:33:41.4158747Z         else:
2025-05-07T20:33:41.4158844Z             scale_ub_tensor = None
2025-05-07T20:33:41.4158928Z     
2025-05-07T20:33:41.4159059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4159159Z             op = silu_mul_quant
2025-05-07T20:33:41.4159249Z             if compiled:
2025-05-07T20:33:41.4159352Z                 op = torch.compile(op)
2025-05-07T20:33:41.4159465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4159542Z     
2025-05-07T20:33:41.4159639Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4159644Z 
2025-05-07T20:33:41.4159776Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4159935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4160041Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4160153Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4160652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4160758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4161118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4161347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4161700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4161801Z     kernel = self.compile(
2025-05-07T20:33:41.4162195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4162421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4162552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4162557Z 
2025-05-07T20:33:41.4162776Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1604e60>
2025-05-07T20:33:41.4163554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4164065Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0aee0c0>}
2025-05-07T20:33:41.4164853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4165049Z context = <triton._C.libtriton.ir.context object at 0x7f2cb034aa70>
2025-05-07T20:33:41.4165093Z 
2025-05-07T20:33:41.4165267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4165801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4165947Z                            module_map=module_map)
2025-05-07T20:33:41.4166114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4166218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4166307Z E       ^
2025-05-07T20:33:41.4166664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4166669Z 
2025-05-07T20:33:41.4167090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4167204Z 
2025-05-07T20:33:41.4167314Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4167540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4167629Z     T=128,
2025-05-07T20:33:41.4167709Z     D=7168,
2025-05-07T20:33:41.4167798Z     scale_ub=None,
2025-05-07T20:33:41.4167891Z     contiguous=False,
2025-05-07T20:33:41.4167976Z     compiled=True,
2025-05-07T20:33:41.4168053Z )
2025-05-07T20:33:41.4168279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4168451Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4168456Z 
2025-05-07T20:33:41.4168543Z     @given(
2025-05-07T20:33:41.4168664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4168766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4168894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4169018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4169134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4169220Z     )
2025-05-07T20:33:41.4169468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4169564Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4169649Z         self,
2025-05-07T20:33:41.4169730Z         T: int,
2025-05-07T20:33:41.4169809Z         D: int,
2025-05-07T20:33:41.4169918Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4170009Z         contiguous: bool,
2025-05-07T20:33:41.4170103Z         compiled: bool,
2025-05-07T20:33:41.4170184Z     ) -> None:
2025-05-07T20:33:41.4170281Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4170364Z     
2025-05-07T20:33:41.4170535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4170612Z     
2025-05-07T20:33:41.4170711Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4170840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4171002Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4171092Z         x0 = x[:, :D]
2025-05-07T20:33:41.4171177Z         x1 = x[:, D:]
2025-05-07T20:33:41.4171254Z     
2025-05-07T20:33:41.4171348Z         if contiguous:
2025-05-07T20:33:41.4171443Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4171541Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4171617Z     
2025-05-07T20:33:41.4171712Z         if scale_ub is not None:
2025-05-07T20:33:41.4171826Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4171963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4172043Z             )
2025-05-07T20:33:41.4172130Z         else:
2025-05-07T20:33:41.4172229Z             scale_ub_tensor = None
2025-05-07T20:33:41.4172306Z     
2025-05-07T20:33:41.4172444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4172603Z             op = silu_mul_quant
2025-05-07T20:33:41.4172699Z             if compiled:
2025-05-07T20:33:41.4172809Z                 op = torch.compile(op)
2025-05-07T20:33:41.4179880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4180083Z     
2025-05-07T20:33:41.4180195Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4180327Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4180413Z     
2025-05-07T20:33:41.4180555Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4180669Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4180777Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4180901Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4181042Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4181126Z     
2025-05-07T20:33:41.4181228Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4181233Z 
2025-05-07T20:33:41.4181339Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4181527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4181641Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4181786Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4182349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4182454Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4182827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4183050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4183428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4183687Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4184066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4184241Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4184586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4184671Z     fn()
2025-05-07T20:33:41.4185077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4185163Z     self.fn.run(
2025-05-07T20:33:41.4185508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4185604Z     kernel = self.compile(
2025-05-07T20:33:41.4185990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4186174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4186390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4186395Z 
2025-05-07T20:33:41.4186610Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb1607aa0>
2025-05-07T20:33:41.4187395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4187903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0aada80>}
2025-05-07T20:33:41.4188655Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4188892Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0b6a9f0>
2025-05-07T20:33:41.4188900Z 
2025-05-07T20:33:41.4189072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4189382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4189493Z                            module_map=module_map)
2025-05-07T20:33:41.4189667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4189771Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4189852Z E       ^
2025-05-07T20:33:41.4190213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4190218Z 
2025-05-07T20:33:41.4190635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4190639Z 
2025-05-07T20:33:41.4190753Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4191020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4191105Z     T=128,
2025-05-07T20:33:41.4191189Z     D=7168,
2025-05-07T20:33:41.4191281Z     scale_ub=None,
2025-05-07T20:33:41.4191377Z     contiguous=False,
2025-05-07T20:33:41.4191464Z     compiled=False,
2025-05-07T20:33:41.4191540Z )
2025-05-07T20:33:41.4192378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4192551Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4192556Z 
2025-05-07T20:33:41.4192635Z     @given(
2025-05-07T20:33:41.4192765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4192868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4192985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4193109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4193228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4193314Z     )
2025-05-07T20:33:41.4193565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4193661Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4193752Z         self,
2025-05-07T20:33:41.4193832Z         T: int,
2025-05-07T20:33:41.4193911Z         D: int,
2025-05-07T20:33:41.4194017Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4194109Z         contiguous: bool,
2025-05-07T20:33:41.4194197Z         compiled: bool,
2025-05-07T20:33:41.4194285Z     ) -> None:
2025-05-07T20:33:41.4194386Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4194462Z     
2025-05-07T20:33:41.4194640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4194721Z     
2025-05-07T20:33:41.4194825Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4194955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4195046Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4195139Z         x0 = x[:, :D]
2025-05-07T20:33:41.4195273Z         x1 = x[:, D:]
2025-05-07T20:33:41.4195352Z     
2025-05-07T20:33:41.4195444Z         if contiguous:
2025-05-07T20:33:41.4195538Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4195632Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4195794Z     
2025-05-07T20:33:41.4195890Z         if scale_ub is not None:
2025-05-07T20:33:41.4195999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4196143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4196222Z             )
2025-05-07T20:33:41.4196310Z         else:
2025-05-07T20:33:41.4196411Z             scale_ub_tensor = None
2025-05-07T20:33:41.4196488Z     
2025-05-07T20:33:41.4196625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4196721Z             op = silu_mul_quant
2025-05-07T20:33:41.4196809Z             if compiled:
2025-05-07T20:33:41.4196917Z                 op = torch.compile(op)
2025-05-07T20:33:41.4197076Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4197159Z     
2025-05-07T20:33:41.4197261Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4197266Z 
2025-05-07T20:33:41.4197407Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4197543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4197647Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4197751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4198261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4198360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4198722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4198954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4199296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4199444Z     kernel = self.compile(
2025-05-07T20:33:41.4199827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4200008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4200143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4200147Z 
2025-05-07T20:33:41.4200353Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb147caa0>
2025-05-07T20:33:41.4201140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4201648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb05e0>}
2025-05-07T20:33:41.4202399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4202598Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0bae130>
2025-05-07T20:33:41.4202603Z 
2025-05-07T20:33:41.4202767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4203042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4203152Z                            module_map=module_map)
2025-05-07T20:33:41.4203315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4203423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4203502Z E       ^
2025-05-07T20:33:41.4203866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4203924Z 
2025-05-07T20:33:41.4204339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4204346Z 
2025-05-07T20:33:41.4204452Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4204682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4204762Z     T=4096,
2025-05-07T20:33:41.4204845Z     D=5120,
2025-05-07T20:33:41.4204937Z     scale_ub=1200.0,
2025-05-07T20:33:41.4205023Z     contiguous=True,
2025-05-07T20:33:41.4205110Z     compiled=False,
2025-05-07T20:33:41.4205191Z )
2025-05-07T20:33:41.4205412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4205597Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4205601Z 
2025-05-07T20:33:41.4205685Z     @given(
2025-05-07T20:33:41.4205852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4205965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4206086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4206246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4206368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4206444Z     )
2025-05-07T20:33:41.4206691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4206795Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4206875Z         self,
2025-05-07T20:33:41.4206961Z         T: int,
2025-05-07T20:33:41.4207039Z         D: int,
2025-05-07T20:33:41.4207140Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4207239Z         contiguous: bool,
2025-05-07T20:33:41.4207327Z         compiled: bool,
2025-05-07T20:33:41.4207408Z     ) -> None:
2025-05-07T20:33:41.4207508Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4207586Z     
2025-05-07T20:33:41.4207803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4207889Z     
2025-05-07T20:33:41.4207984Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4208113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4208209Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4208292Z         x0 = x[:, :D]
2025-05-07T20:33:41.4208378Z         x1 = x[:, D:]
2025-05-07T20:33:41.4208452Z     
2025-05-07T20:33:41.4208538Z         if contiguous:
2025-05-07T20:33:41.4208636Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4208726Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4208801Z     
2025-05-07T20:33:41.4208897Z         if scale_ub is not None:
2025-05-07T20:33:41.4209005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4209143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4209227Z             )
2025-05-07T20:33:41.4209306Z         else:
2025-05-07T20:33:41.4209410Z             scale_ub_tensor = None
2025-05-07T20:33:41.4209496Z     
2025-05-07T20:33:41.4209657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4209761Z             op = silu_mul_quant
2025-05-07T20:33:41.4209867Z             if compiled:
2025-05-07T20:33:41.4209970Z                 op = torch.compile(op)
2025-05-07T20:33:41.4210087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4210164Z     
2025-05-07T20:33:41.4210259Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4210264Z 
2025-05-07T20:33:41.4210371Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4210502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4210605Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4210715Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4211214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4211322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4211731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4211958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4212300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4212399Z     kernel = self.compile(
2025-05-07T20:33:41.4212781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4212962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4213093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4213097Z 
2025-05-07T20:33:41.4213309Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb147c650>
2025-05-07T20:33:41.4214129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4214675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb1b20>}
2025-05-07T20:33:41.4215426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4215617Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0bbec30>
2025-05-07T20:33:41.4215621Z 
2025-05-07T20:33:41.4215791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4216055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4216171Z                            module_map=module_map)
2025-05-07T20:33:41.4216403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4216506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4216594Z E       ^
2025-05-07T20:33:41.4216951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4216956Z 
2025-05-07T20:33:41.4217368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4217373Z 
2025-05-07T20:33:41.4217489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4217716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4217803Z     T=1,
2025-05-07T20:33:41.4217881Z     D=5120,
2025-05-07T20:33:41.4217966Z     scale_ub=None,
2025-05-07T20:33:41.4218059Z     contiguous=True,
2025-05-07T20:33:41.4218148Z     compiled=True,
2025-05-07T20:33:41.4218224Z )
2025-05-07T20:33:41.4218456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4218618Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4218625Z 
2025-05-07T20:33:41.4218705Z     @given(
2025-05-07T20:33:41.4218829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4218928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4219049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4219168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4219283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4219365Z     )
2025-05-07T20:33:41.4219610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4219706Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4219793Z         self,
2025-05-07T20:33:41.4219872Z         T: int,
2025-05-07T20:33:41.4219954Z         D: int,
2025-05-07T20:33:41.4220106Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4220202Z         contiguous: bool,
2025-05-07T20:33:41.4220292Z         compiled: bool,
2025-05-07T20:33:41.4220379Z     ) -> None:
2025-05-07T20:33:41.4220479Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4220558Z     
2025-05-07T20:33:41.4220729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4220805Z     
2025-05-07T20:33:41.4220902Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4221028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4221120Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4221209Z         x0 = x[:, :D]
2025-05-07T20:33:41.4221291Z         x1 = x[:, D:]
2025-05-07T20:33:41.4221366Z     
2025-05-07T20:33:41.4221456Z         if contiguous:
2025-05-07T20:33:41.4221549Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4221640Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4221720Z     
2025-05-07T20:33:41.4221858Z         if scale_ub is not None:
2025-05-07T20:33:41.4221976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4222112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4222230Z             )
2025-05-07T20:33:41.4222313Z         else:
2025-05-07T20:33:41.4222410Z             scale_ub_tensor = None
2025-05-07T20:33:41.4222484Z     
2025-05-07T20:33:41.4222618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4222714Z             op = silu_mul_quant
2025-05-07T20:33:41.4222803Z             if compiled:
2025-05-07T20:33:41.4222911Z                 op = torch.compile(op)
2025-05-07T20:33:41.4223018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4223095Z     
2025-05-07T20:33:41.4223192Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4223314Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4223396Z     
2025-05-07T20:33:41.4223537Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4223646Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4223795Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4223919Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4224064Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4224145Z     
2025-05-07T20:33:41.4224247Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4224251Z 
2025-05-07T20:33:41.4224350Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4224485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4224596Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4224735Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4225293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4225399Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4225770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4225994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4226365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4226632Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4227009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4227181Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4227523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4227601Z     fn()
2025-05-07T20:33:41.4228011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4228139Z     self.fn.run(
2025-05-07T20:33:41.4228485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4228583Z     kernel = self.compile(
2025-05-07T20:33:41.4228962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4229143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4229275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4229279Z 
2025-05-07T20:33:41.4229485Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb08e53a0>
2025-05-07T20:33:41.4230355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4230866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cb2a20>}
2025-05-07T20:33:41.4231655Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4231849Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbf1a430>
2025-05-07T20:33:41.4231854Z 
2025-05-07T20:33:41.4232026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4232290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4232398Z                            module_map=module_map)
2025-05-07T20:33:41.4232564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4232672Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4232795Z E       ^
2025-05-07T20:33:41.4233158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4233165Z 
2025-05-07T20:33:41.4233578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4233583Z 
2025-05-07T20:33:41.4233693Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4233919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4233998Z     T=2048,
2025-05-07T20:33:41.4234083Z     D=5120,
2025-05-07T20:33:41.4234167Z     scale_ub=None,
2025-05-07T20:33:41.4234254Z     contiguous=True,
2025-05-07T20:33:41.4234345Z     compiled=True,
2025-05-07T20:33:41.4234420Z )
2025-05-07T20:33:41.4234645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4234821Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4234831Z 
2025-05-07T20:33:41.4234910Z     @given(
2025-05-07T20:33:41.4235036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4235140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4235258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4235382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4235497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4235574Z     )
2025-05-07T20:33:41.4235892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4235992Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4236074Z         self,
2025-05-07T20:33:41.4236153Z         T: int,
2025-05-07T20:33:41.4236232Z         D: int,
2025-05-07T20:33:41.4236339Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4236430Z         contiguous: bool,
2025-05-07T20:33:41.4236525Z         compiled: bool,
2025-05-07T20:33:41.4236659Z     ) -> None:
2025-05-07T20:33:41.4236759Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4236835Z     
2025-05-07T20:33:41.4237011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4237093Z     
2025-05-07T20:33:41.4237188Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4237319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4237411Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4237502Z         x0 = x[:, :D]
2025-05-07T20:33:41.4237583Z         x1 = x[:, D:]
2025-05-07T20:33:41.4237659Z     
2025-05-07T20:33:41.4237749Z         if contiguous:
2025-05-07T20:33:41.4237842Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4237936Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4238013Z     
2025-05-07T20:33:41.4238105Z         if scale_ub is not None:
2025-05-07T20:33:41.4238211Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4238396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4238479Z             )
2025-05-07T20:33:41.4238558Z         else:
2025-05-07T20:33:41.4238660Z             scale_ub_tensor = None
2025-05-07T20:33:41.4238777Z     
2025-05-07T20:33:41.4238908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4239007Z             op = silu_mul_quant
2025-05-07T20:33:41.4239098Z             if compiled:
2025-05-07T20:33:41.4239207Z                 op = torch.compile(op)
2025-05-07T20:33:41.4239318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4239393Z     
2025-05-07T20:33:41.4239490Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4239615Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4239690Z     
2025-05-07T20:33:41.4239834Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4239945Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4240052Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4240225Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4240377Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4240457Z     
2025-05-07T20:33:41.4240559Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4240563Z 
2025-05-07T20:33:41.4240666Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4240798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4240916Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4241050Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4241611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4241718Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4242086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4242311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4242685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4242946Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4243325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4243492Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4243833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4243919Z     fn()
2025-05-07T20:33:41.4244319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4244408Z     self.fn.run(
2025-05-07T20:33:41.4244753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4244895Z     kernel = self.compile(
2025-05-07T20:33:41.4245279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4245457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4245589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4245597Z 
2025-05-07T20:33:41.4245804Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb03a9dc0>
2025-05-07T20:33:41.4246581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4247135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0cce020>}
2025-05-07T20:33:41.4247886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4248146Z context = <triton._C.libtriton.ir.context object at 0x7f2cb02f1af0>
2025-05-07T20:33:41.4248151Z 
2025-05-07T20:33:41.4248316Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4248580Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4248694Z                            module_map=module_map)
2025-05-07T20:33:41.4248857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4248963Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4249050Z E       ^
2025-05-07T20:33:41.4249452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4249461Z 
2025-05-07T20:33:41.4249876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4249883Z 
2025-05-07T20:33:41.4249992Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4250216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4250305Z     T=128,
2025-05-07T20:33:41.4250383Z     D=5120,
2025-05-07T20:33:41.4250473Z     scale_ub=None,
2025-05-07T20:33:41.4250562Z     contiguous=True,
2025-05-07T20:33:41.4250647Z     compiled=True,
2025-05-07T20:33:41.4250726Z )
2025-05-07T20:33:41.4250949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4251118Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4251122Z 
2025-05-07T20:33:41.4251209Z     @given(
2025-05-07T20:33:41.4251338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4251447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4251570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4251692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4251815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4251891Z     )
2025-05-07T20:33:41.4252136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4252240Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4252321Z         self,
2025-05-07T20:33:41.4252400Z         T: int,
2025-05-07T20:33:41.4252485Z         D: int,
2025-05-07T20:33:41.4252589Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4252681Z         contiguous: bool,
2025-05-07T20:33:41.4252776Z         compiled: bool,
2025-05-07T20:33:41.4252855Z     ) -> None:
2025-05-07T20:33:41.4252952Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4253034Z     
2025-05-07T20:33:41.4253207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4253330Z     
2025-05-07T20:33:41.4253428Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4253557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4253653Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4253738Z         x0 = x[:, :D]
2025-05-07T20:33:41.4253822Z         x1 = x[:, D:]
2025-05-07T20:33:41.4253901Z     
2025-05-07T20:33:41.4253986Z         if contiguous:
2025-05-07T20:33:41.4254080Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4254177Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4254252Z     
2025-05-07T20:33:41.4254343Z         if scale_ub is not None:
2025-05-07T20:33:41.4254456Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4254588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4254666Z             )
2025-05-07T20:33:41.4254750Z         else:
2025-05-07T20:33:41.4254894Z             scale_ub_tensor = None
2025-05-07T20:33:41.4254976Z     
2025-05-07T20:33:41.4255108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4255201Z             op = silu_mul_quant
2025-05-07T20:33:41.4255332Z             if compiled:
2025-05-07T20:33:41.4255433Z                 op = torch.compile(op)
2025-05-07T20:33:41.4255540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4255621Z     
2025-05-07T20:33:41.4255713Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4255835Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4255914Z     
2025-05-07T20:33:41.4256051Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4256155Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4256261Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4256385Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4256535Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4256614Z     
2025-05-07T20:33:41.4256758Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4256762Z 
2025-05-07T20:33:41.4256868Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4257002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4257109Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4257251Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4257813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4257920Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4258279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4258501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4258877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4259137Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4259524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4259693Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4260035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4260119Z     fn()
2025-05-07T20:33:41.4260519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4260606Z     self.fn.run(
2025-05-07T20:33:41.4260946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4261044Z     kernel = self.compile(
2025-05-07T20:33:41.4261437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4261656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4261786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4261791Z 
2025-05-07T20:33:41.4262002Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0cf7da0>
2025-05-07T20:33:41.4262779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4263286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2cb0e3aac0>}
2025-05-07T20:33:41.4264075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4264307Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb91abb0>
2025-05-07T20:33:41.4264312Z 
2025-05-07T20:33:41.4264481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4264748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4264865Z                            module_map=module_map)
2025-05-07T20:33:41.4265028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4265132Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4265216Z E       ^
2025-05-07T20:33:41.4265843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4265849Z 
2025-05-07T20:33:41.4266360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4266369Z 
2025-05-07T20:33:41.4266475Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4266704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4266788Z     T=4096,
2025-05-07T20:33:41.4266866Z     D=5120,
2025-05-07T20:33:41.4266950Z     scale_ub=None,
2025-05-07T20:33:41.4267046Z     contiguous=True,
2025-05-07T20:33:41.4267130Z     compiled=True,
2025-05-07T20:33:41.4267205Z )
2025-05-07T20:33:41.4267435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4267606Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4267611Z 
2025-05-07T20:33:41.4267695Z     @given(
2025-05-07T20:33:41.4267817Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4267919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4268043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4268167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4268282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4268370Z     )
2025-05-07T20:33:41.4268617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4268714Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4268795Z         self,
2025-05-07T20:33:41.4268873Z         T: int,
2025-05-07T20:33:41.4268960Z         D: int,
2025-05-07T20:33:41.4269062Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4269154Z         contiguous: bool,
2025-05-07T20:33:41.4269246Z         compiled: bool,
2025-05-07T20:33:41.4269329Z     ) -> None:
2025-05-07T20:33:41.4269430Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4269510Z     
2025-05-07T20:33:41.4269682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4269757Z     
2025-05-07T20:33:41.4269861Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4270061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4270153Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4270242Z         x0 = x[:, :D]
2025-05-07T20:33:41.4270325Z         x1 = x[:, D:]
2025-05-07T20:33:41.4270399Z     
2025-05-07T20:33:41.4270491Z         if contiguous:
2025-05-07T20:33:41.4270585Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4270680Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4270754Z     
2025-05-07T20:33:41.4270845Z         if scale_ub is not None:
2025-05-07T20:33:41.4270962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4271097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4271175Z             )
2025-05-07T20:33:41.4271261Z         else:
2025-05-07T20:33:41.4271359Z             scale_ub_tensor = None
2025-05-07T20:33:41.4271434Z     
2025-05-07T20:33:41.4271633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4271727Z             op = silu_mul_quant
2025-05-07T20:33:41.4271820Z             if compiled:
2025-05-07T20:33:41.4271924Z                 op = torch.compile(op)
2025-05-07T20:33:41.4272090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4272168Z     
2025-05-07T20:33:41.4272261Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4272383Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4272463Z     
2025-05-07T20:33:41.4272602Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4272707Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4272818Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4272941Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4273082Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4273163Z     
2025-05-07T20:33:41.4273264Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4273276Z 
2025-05-07T20:33:41.4273383Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4273558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4273667Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4273815Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4274373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4274475Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4274838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4275060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4275430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4275737Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4276117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4276293Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4276633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4276716Z     fn()
2025-05-07T20:33:41.4277116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4277200Z     self.fn.run(
2025-05-07T20:33:41.4277542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4277637Z     kernel = self.compile(
2025-05-07T20:33:41.4278017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4278200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4278408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4278415Z 
2025-05-07T20:33:41.4278626Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2e510>
2025-05-07T20:33:41.4279402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4279908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbcfb560>}
2025-05-07T20:33:41.4280730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4280929Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0137930>
2025-05-07T20:33:41.4280934Z 
2025-05-07T20:33:41.4281107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4281414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4281523Z                            module_map=module_map)
2025-05-07T20:33:41.4281691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4281796Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4281887Z E       ^
2025-05-07T20:33:41.4282241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4282246Z 
2025-05-07T20:33:41.4282657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4282661Z 
2025-05-07T20:33:41.4282774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4283038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4283129Z     T=16384,
2025-05-07T20:33:41.4283212Z     D=5120,
2025-05-07T20:33:41.4283298Z     scale_ub=None,
2025-05-07T20:33:41.4283392Z     contiguous=True,
2025-05-07T20:33:41.4283477Z     compiled=True,
2025-05-07T20:33:41.4283555Z )
2025-05-07T20:33:41.4283779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4283954Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4283958Z 
2025-05-07T20:33:41.4284037Z     @given(
2025-05-07T20:33:41.4284167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4284269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4284386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4284510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4284626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4284711Z     )
2025-05-07T20:33:41.4284957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4285056Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4285142Z         self,
2025-05-07T20:33:41.4285221Z         T: int,
2025-05-07T20:33:41.4285300Z         D: int,
2025-05-07T20:33:41.4285403Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4285496Z         contiguous: bool,
2025-05-07T20:33:41.4285583Z         compiled: bool,
2025-05-07T20:33:41.4285666Z     ) -> None:
2025-05-07T20:33:41.4285761Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4285835Z     
2025-05-07T20:33:41.4286007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4286081Z     
2025-05-07T20:33:41.4286185Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4286310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4286403Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4286536Z         x0 = x[:, :D]
2025-05-07T20:33:41.4286622Z         x1 = x[:, D:]
2025-05-07T20:33:41.4286697Z     
2025-05-07T20:33:41.4286785Z         if contiguous:
2025-05-07T20:33:41.4286881Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4286971Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4287049Z     
2025-05-07T20:33:41.4287140Z         if scale_ub is not None:
2025-05-07T20:33:41.4287248Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4287385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4287463Z             )
2025-05-07T20:33:41.4287544Z         else:
2025-05-07T20:33:41.4287641Z             scale_ub_tensor = None
2025-05-07T20:33:41.4287716Z     
2025-05-07T20:33:41.4287847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4287940Z             op = silu_mul_quant
2025-05-07T20:33:41.4288026Z             if compiled:
2025-05-07T20:33:41.4288255Z                 op = torch.compile(op)
2025-05-07T20:33:41.4288370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4288444Z     
2025-05-07T20:33:41.4288543Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4288712Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4288787Z     
2025-05-07T20:33:41.4288932Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4289036Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4289142Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4289264Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4289408Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4289491Z     
2025-05-07T20:33:41.4289593Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4289597Z 
2025-05-07T20:33:41.4289700Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4289839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4289949Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4290123Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4290686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4290792Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4291159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4291382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4291747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4292006Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4292386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4292561Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4292903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4292988Z     fn()
2025-05-07T20:33:41.4293392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4293475Z     self.fn.run(
2025-05-07T20:33:41.4293813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4293911Z     kernel = self.compile(
2025-05-07T20:33:41.4294291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4294469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4294601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4294649Z 
2025-05-07T20:33:41.4294859Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2d1c0>
2025-05-07T20:33:41.4295638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4296144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb551620>}
2025-05-07T20:33:41.4296891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4297086Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb779bf0>
2025-05-07T20:33:41.4297091Z 
2025-05-07T20:33:41.4297302Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4297571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4297717Z                            module_map=module_map)
2025-05-07T20:33:41.4297885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4297991Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4298071Z E       ^
2025-05-07T20:33:41.4298428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4298432Z 
2025-05-07T20:33:41.4298843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4298847Z 
2025-05-07T20:33:41.4298955Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4299182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4299263Z     T=1,
2025-05-07T20:33:41.4299349Z     D=5120,
2025-05-07T20:33:41.4299474Z     scale_ub=1200.0,
2025-05-07T20:33:41.4299560Z     contiguous=True,
2025-05-07T20:33:41.4299665Z     compiled=True,
2025-05-07T20:33:41.4299755Z )
2025-05-07T20:33:41.4300000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4300172Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4300176Z 
2025-05-07T20:33:41.4300255Z     @given(
2025-05-07T20:33:41.4300381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4300484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4300598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4300721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4300834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4300910Z     )
2025-05-07T20:33:41.4301162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4301263Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4301348Z         self,
2025-05-07T20:33:41.4301426Z         T: int,
2025-05-07T20:33:41.4301511Z         D: int,
2025-05-07T20:33:41.4301616Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4301707Z         contiguous: bool,
2025-05-07T20:33:41.4301797Z         compiled: bool,
2025-05-07T20:33:41.4301883Z     ) -> None:
2025-05-07T20:33:41.4301981Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4302055Z     
2025-05-07T20:33:41.4302227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4302302Z     
2025-05-07T20:33:41.4302394Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4302523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4302614Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4302696Z         x0 = x[:, :D]
2025-05-07T20:33:41.4302783Z         x1 = x[:, D:]
2025-05-07T20:33:41.4302860Z     
2025-05-07T20:33:41.4302951Z         if contiguous:
2025-05-07T20:33:41.4303095Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4303186Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4303266Z     
2025-05-07T20:33:41.4303357Z         if scale_ub is not None:
2025-05-07T20:33:41.4303470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4310702Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4310796Z             )
2025-05-07T20:33:41.4310882Z         else:
2025-05-07T20:33:41.4310983Z             scale_ub_tensor = None
2025-05-07T20:33:41.4311057Z     
2025-05-07T20:33:41.4311197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4311291Z             op = silu_mul_quant
2025-05-07T20:33:41.4311386Z             if compiled:
2025-05-07T20:33:41.4311489Z                 op = torch.compile(op)
2025-05-07T20:33:41.4311594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4311669Z     
2025-05-07T20:33:41.4311833Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4311841Z 
2025-05-07T20:33:41.4311945Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4312081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4312228Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4312328Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4312709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4312804Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4313299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4313398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4313755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4313986Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4314368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4314471Z     kernel = self.compile(
2025-05-07T20:33:41.4314857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4315030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4315162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4315167Z 
2025-05-07T20:33:41.4315373Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb998440>
2025-05-07T20:33:41.4316221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4316740Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5c900>}
2025-05-07T20:33:41.4317487Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4317686Z context = <triton._C.libtriton.ir.context object at 0x7f2bfae3d970>
2025-05-07T20:33:41.4317690Z 
2025-05-07T20:33:41.4317854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4318120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4318228Z                            module_map=module_map)
2025-05-07T20:33:41.4318392Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4318495Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4318576Z E       ^
2025-05-07T20:33:41.4318933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4319013Z 
2025-05-07T20:33:41.4319431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4319436Z 
2025-05-07T20:33:41.4319541Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4319767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4319847Z     T=1,
2025-05-07T20:33:41.4319926Z     D=5120,
2025-05-07T20:33:41.4320010Z     scale_ub=None,
2025-05-07T20:33:41.4320095Z     contiguous=False,
2025-05-07T20:33:41.4320178Z     compiled=True,
2025-05-07T20:33:41.4320256Z )
2025-05-07T20:33:41.4320473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4320643Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4320701Z 
2025-05-07T20:33:41.4320783Z     @given(
2025-05-07T20:33:41.4320912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4321022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4321182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4321300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4321421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4321497Z     )
2025-05-07T20:33:41.4321740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4321839Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4321915Z         self,
2025-05-07T20:33:41.4321992Z         T: int,
2025-05-07T20:33:41.4322071Z         D: int,
2025-05-07T20:33:41.4322169Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4322260Z         contiguous: bool,
2025-05-07T20:33:41.4322344Z         compiled: bool,
2025-05-07T20:33:41.4322422Z     ) -> None:
2025-05-07T20:33:41.4322521Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4322598Z     
2025-05-07T20:33:41.4322813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4322894Z     
2025-05-07T20:33:41.4322991Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4323117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4323210Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4323293Z         x0 = x[:, :D]
2025-05-07T20:33:41.4323373Z         x1 = x[:, D:]
2025-05-07T20:33:41.4323449Z     
2025-05-07T20:33:41.4323531Z         if contiguous:
2025-05-07T20:33:41.4323626Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4323716Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4323788Z     
2025-05-07T20:33:41.4323882Z         if scale_ub is not None:
2025-05-07T20:33:41.4323988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4324125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4324208Z             )
2025-05-07T20:33:41.4324285Z         else:
2025-05-07T20:33:41.4324385Z             scale_ub_tensor = None
2025-05-07T20:33:41.4324463Z     
2025-05-07T20:33:41.4324594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4324688Z             op = silu_mul_quant
2025-05-07T20:33:41.4324777Z             if compiled:
2025-05-07T20:33:41.4324876Z                 op = torch.compile(op)
2025-05-07T20:33:41.4324987Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4325059Z     
2025-05-07T20:33:41.4325151Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4325278Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4325350Z     
2025-05-07T20:33:41.4325487Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4325595Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4325694Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4325816Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4325960Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4326082Z     
2025-05-07T20:33:41.4326184Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4326195Z 
2025-05-07T20:33:41.4326291Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4326422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4326535Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4326669Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4327230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4327342Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4327699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4327969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4328342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4328599Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4329017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4329184Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4329522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4329604Z     fn()
2025-05-07T20:33:41.4330051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4330138Z     self.fn.run(
2025-05-07T20:33:41.4330479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4330574Z     kernel = self.compile(
2025-05-07T20:33:41.4330995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4331174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4331304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4331314Z 
2025-05-07T20:33:41.4331519Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb53d3d0>
2025-05-07T20:33:41.4332300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4332814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5ec00>}
2025-05-07T20:33:41.4333569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4333769Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaf1b1b0>
2025-05-07T20:33:41.4333774Z 
2025-05-07T20:33:41.4333940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4334201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4334313Z                            module_map=module_map)
2025-05-07T20:33:41.4334477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4334585Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4334664Z E       ^
2025-05-07T20:33:41.4335021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4335025Z 
2025-05-07T20:33:41.4335485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4335491Z 
2025-05-07T20:33:41.4335594Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4335815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4335897Z     T=1,
2025-05-07T20:33:41.4335974Z     D=5120,
2025-05-07T20:33:41.4336061Z     scale_ub=None,
2025-05-07T20:33:41.4336145Z     contiguous=True,
2025-05-07T20:33:41.4336230Z     compiled=False,
2025-05-07T20:33:41.4336307Z )
2025-05-07T20:33:41.4336526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4336691Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4336695Z 
2025-05-07T20:33:41.4336775Z     @given(
2025-05-07T20:33:41.4336894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4337037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4337160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4337276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4337432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4337506Z     )
2025-05-07T20:33:41.4337747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4337844Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4337920Z         self,
2025-05-07T20:33:41.4337999Z         T: int,
2025-05-07T20:33:41.4338081Z         D: int,
2025-05-07T20:33:41.4338179Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4338267Z         contiguous: bool,
2025-05-07T20:33:41.4338357Z         compiled: bool,
2025-05-07T20:33:41.4338435Z     ) -> None:
2025-05-07T20:33:41.4338530Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4338607Z     
2025-05-07T20:33:41.4338778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4338859Z     
2025-05-07T20:33:41.4338996Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4339121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4339217Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4339297Z         x0 = x[:, :D]
2025-05-07T20:33:41.4339377Z         x1 = x[:, D:]
2025-05-07T20:33:41.4339452Z     
2025-05-07T20:33:41.4339534Z         if contiguous:
2025-05-07T20:33:41.4339628Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4339725Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4339799Z     
2025-05-07T20:33:41.4339890Z         if scale_ub is not None:
2025-05-07T20:33:41.4339998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4340132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4340216Z             )
2025-05-07T20:33:41.4340297Z         else:
2025-05-07T20:33:41.4340391Z             scale_ub_tensor = None
2025-05-07T20:33:41.4340470Z     
2025-05-07T20:33:41.4340601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4340696Z             op = silu_mul_quant
2025-05-07T20:33:41.4340784Z             if compiled:
2025-05-07T20:33:41.4340886Z                 op = torch.compile(op)
2025-05-07T20:33:41.4340992Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4341070Z     
2025-05-07T20:33:41.4341162Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4341166Z 
2025-05-07T20:33:41.4341262Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4341394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4341499Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4341605Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4342100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4342200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4342567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4342838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4343179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4343276Z     kernel = self.compile(
2025-05-07T20:33:41.4343655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4343832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4343961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4343966Z 
2025-05-07T20:33:41.4344168Z self = <triton.compiler.compiler.ASTSource object at 0x7f2cb0e2fe90>
2025-05-07T20:33:41.4344991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4345499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5f6a0>}
2025-05-07T20:33:41.4346286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4346478Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaece930>
2025-05-07T20:33:41.4346482Z 
2025-05-07T20:33:41.4346652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4346913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4347024Z                            module_map=module_map)
2025-05-07T20:33:41.4347191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4347356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4347437Z E       ^
2025-05-07T20:33:41.4347798Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4347803Z 
2025-05-07T20:33:41.4348213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4348218Z 
2025-05-07T20:33:41.4348328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4348551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4348630Z     T=128,
2025-05-07T20:33:41.4348713Z     D=5120,
2025-05-07T20:33:41.4348798Z     scale_ub=None,
2025-05-07T20:33:41.4348885Z     contiguous=False,
2025-05-07T20:33:41.4348972Z     compiled=True,
2025-05-07T20:33:41.4349047Z )
2025-05-07T20:33:41.4349268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4349452Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4349456Z 
2025-05-07T20:33:41.4349540Z     @given(
2025-05-07T20:33:41.4349663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4349765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4349879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4349998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4350111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4350186Z     )
2025-05-07T20:33:41.4350431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4350528Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4350608Z         self,
2025-05-07T20:33:41.4350688Z         T: int,
2025-05-07T20:33:41.4350767Z         D: int,
2025-05-07T20:33:41.4350875Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4350965Z         contiguous: bool,
2025-05-07T20:33:41.4351103Z         compiled: bool,
2025-05-07T20:33:41.4351186Z     ) -> None:
2025-05-07T20:33:41.4351282Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4351360Z     
2025-05-07T20:33:41.4351533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4351613Z     
2025-05-07T20:33:41.4351705Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4351834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4351923Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4352007Z         x0 = x[:, :D]
2025-05-07T20:33:41.4352092Z         x1 = x[:, D:]
2025-05-07T20:33:41.4352168Z     
2025-05-07T20:33:41.4352258Z         if contiguous:
2025-05-07T20:33:41.4352351Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4352442Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4352518Z     
2025-05-07T20:33:41.4352611Z         if scale_ub is not None:
2025-05-07T20:33:41.4352766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4352909Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4352987Z             )
2025-05-07T20:33:41.4353107Z         else:
2025-05-07T20:33:41.4353206Z             scale_ub_tensor = None
2025-05-07T20:33:41.4353280Z     
2025-05-07T20:33:41.4353408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4353503Z             op = silu_mul_quant
2025-05-07T20:33:41.4353590Z             if compiled:
2025-05-07T20:33:41.4353693Z                 op = torch.compile(op)
2025-05-07T20:33:41.4353802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4353878Z     
2025-05-07T20:33:41.4353975Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4353980Z 
2025-05-07T20:33:41.4354078Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4354207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4354315Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4354423Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4354832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4354937Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4355428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4355529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4355939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4356160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4356503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4356597Z     kernel = self.compile(
2025-05-07T20:33:41.4356981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4357161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4357288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4357295Z 
2025-05-07T20:33:41.4357507Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbef73e0>
2025-05-07T20:33:41.4358283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4358790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbe5f100>}
2025-05-07T20:33:41.4359540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4359778Z context = <triton._C.libtriton.ir.context object at 0x7f2bfafaf1f0>
2025-05-07T20:33:41.4359785Z 
2025-05-07T20:33:41.4359954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4360215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4360332Z                            module_map=module_map)
2025-05-07T20:33:41.4360493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4360592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4360677Z E       ^
2025-05-07T20:33:41.4361032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4361037Z 
2025-05-07T20:33:41.4361488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4361504Z 
2025-05-07T20:33:41.4361610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4361833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4361961Z     T=128,
2025-05-07T20:33:41.4362040Z     D=7168,
2025-05-07T20:33:41.4362125Z     scale_ub=1200.0,
2025-05-07T20:33:41.4362217Z     contiguous=False,
2025-05-07T20:33:41.4362300Z     compiled=False,
2025-05-07T20:33:41.4362375Z )
2025-05-07T20:33:41.4362596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4362768Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4362772Z 
2025-05-07T20:33:41.4362858Z     @given(
2025-05-07T20:33:41.4362978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4363078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4363199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4363319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4363477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4363559Z     )
2025-05-07T20:33:41.4363807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4363904Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4363985Z         self,
2025-05-07T20:33:41.4364064Z         T: int,
2025-05-07T20:33:41.4364145Z         D: int,
2025-05-07T20:33:41.4364244Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4364334Z         contiguous: bool,
2025-05-07T20:33:41.4364425Z         compiled: bool,
2025-05-07T20:33:41.4364506Z     ) -> None:
2025-05-07T20:33:41.4364602Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4364681Z     
2025-05-07T20:33:41.4364848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4364925Z     
2025-05-07T20:33:41.4365024Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4365153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4365250Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4365336Z         x0 = x[:, :D]
2025-05-07T20:33:41.4365812Z         x1 = x[:, D:]
2025-05-07T20:33:41.4365918Z     
2025-05-07T20:33:41.4366012Z         if contiguous:
2025-05-07T20:33:41.4366106Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4366198Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4366272Z     
2025-05-07T20:33:41.4366363Z         if scale_ub is not None:
2025-05-07T20:33:41.4366474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4366608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4366689Z             )
2025-05-07T20:33:41.4366768Z         else:
2025-05-07T20:33:41.4366862Z             scale_ub_tensor = None
2025-05-07T20:33:41.4366935Z     
2025-05-07T20:33:41.4367069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4367161Z             op = silu_mul_quant
2025-05-07T20:33:41.4367249Z             if compiled:
2025-05-07T20:33:41.4367457Z                 op = torch.compile(op)
2025-05-07T20:33:41.4367562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4367640Z     
2025-05-07T20:33:41.4367730Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4367735Z 
2025-05-07T20:33:41.4367832Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4367966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4368066Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4368164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4368662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4368757Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4369116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4369403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4369749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4369908Z     kernel = self.compile(
2025-05-07T20:33:41.4370287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4370460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4370591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4370596Z 
2025-05-07T20:33:41.4370799Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbef4a10>
2025-05-07T20:33:41.4371583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4372147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbcf9940>}
2025-05-07T20:33:41.4372903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4373094Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaf70c70>
2025-05-07T20:33:41.4373098Z 
2025-05-07T20:33:41.4373263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4373533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4373641Z                            module_map=module_map)
2025-05-07T20:33:41.4373809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4373912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4373990Z E       ^
2025-05-07T20:33:41.4374356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4374363Z 
2025-05-07T20:33:41.4374774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4374778Z 
2025-05-07T20:33:41.4374883Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4375106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4375183Z     T=128,
2025-05-07T20:33:41.4375266Z     D=5120,
2025-05-07T20:33:41.4375347Z     scale_ub=None,
2025-05-07T20:33:41.4375437Z     contiguous=False,
2025-05-07T20:33:41.4375524Z     compiled=False,
2025-05-07T20:33:41.4375598Z )
2025-05-07T20:33:41.4375815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4375991Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4376040Z 
2025-05-07T20:33:41.4376119Z     @given(
2025-05-07T20:33:41.4376242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4376345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4376458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4376578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4376692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4376769Z     )
2025-05-07T20:33:41.4377014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4377113Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4377189Z         self,
2025-05-07T20:33:41.4377271Z         T: int,
2025-05-07T20:33:41.4377348Z         D: int,
2025-05-07T20:33:41.4377446Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4377542Z         contiguous: bool,
2025-05-07T20:33:41.4377627Z         compiled: bool,
2025-05-07T20:33:41.4377752Z     ) -> None:
2025-05-07T20:33:41.4377855Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4377928Z     
2025-05-07T20:33:41.4378101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4378238Z     
2025-05-07T20:33:41.4378331Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4378460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4378549Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4378630Z         x0 = x[:, :D]
2025-05-07T20:33:41.4378713Z         x1 = x[:, D:]
2025-05-07T20:33:41.4378786Z     
2025-05-07T20:33:41.4378871Z         if contiguous:
2025-05-07T20:33:41.4378966Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4379055Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4379128Z     
2025-05-07T20:33:41.4379224Z         if scale_ub is not None:
2025-05-07T20:33:41.4379329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4379471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4379549Z             )
2025-05-07T20:33:41.4379674Z         else:
2025-05-07T20:33:41.4379777Z             scale_ub_tensor = None
2025-05-07T20:33:41.4379851Z     
2025-05-07T20:33:41.4379983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4380077Z             op = silu_mul_quant
2025-05-07T20:33:41.4380163Z             if compiled:
2025-05-07T20:33:41.4380260Z                 op = torch.compile(op)
2025-05-07T20:33:41.4380374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4380448Z     
2025-05-07T20:33:41.4380539Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4380548Z 
2025-05-07T20:33:41.4380644Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4380774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4380877Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4380977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4381475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4381577Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4381932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4382160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4382498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4382590Z     kernel = self.compile(
2025-05-07T20:33:41.4382969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4383143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4383272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4383276Z 
2025-05-07T20:33:41.4383484Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf11640>
2025-05-07T20:33:41.4384309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4384819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed04a0>}
2025-05-07T20:33:41.4385563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4385757Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb3b8e30>
2025-05-07T20:33:41.4385762Z 
2025-05-07T20:33:41.4385927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4386232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4386348Z                            module_map=module_map)
2025-05-07T20:33:41.4386548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4386648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4386728Z E       ^
2025-05-07T20:33:41.4387081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4387085Z 
2025-05-07T20:33:41.4387497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4387502Z 
2025-05-07T20:33:41.4387604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4387828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4387913Z     T=128,
2025-05-07T20:33:41.4387991Z     D=5120,
2025-05-07T20:33:41.4388080Z     scale_ub=1200.0,
2025-05-07T20:33:41.4388211Z     contiguous=True,
2025-05-07T20:33:41.4388298Z     compiled=False,
2025-05-07T20:33:41.4388381Z )
2025-05-07T20:33:41.4388600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4388773Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4388778Z 
2025-05-07T20:33:41.4388861Z     @given(
2025-05-07T20:33:41.4388981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4389081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4389202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4389320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4389436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4389518Z     )
2025-05-07T20:33:41.4389760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4389863Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4389944Z         self,
2025-05-07T20:33:41.4390029Z         T: int,
2025-05-07T20:33:41.4390111Z         D: int,
2025-05-07T20:33:41.4390211Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4390304Z         contiguous: bool,
2025-05-07T20:33:41.4390395Z         compiled: bool,
2025-05-07T20:33:41.4390474Z     ) -> None:
2025-05-07T20:33:41.4390569Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4390649Z     
2025-05-07T20:33:41.4390818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4390895Z     
2025-05-07T20:33:41.4390994Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4391119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4391212Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4391297Z         x0 = x[:, :D]
2025-05-07T20:33:41.4391381Z         x1 = x[:, D:]
2025-05-07T20:33:41.4391461Z     
2025-05-07T20:33:41.4391545Z         if contiguous:
2025-05-07T20:33:41.4391639Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4391784Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4391860Z     
2025-05-07T20:33:41.4391952Z         if scale_ub is not None:
2025-05-07T20:33:41.4392067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4392199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4392280Z             )
2025-05-07T20:33:41.4392363Z         else:
2025-05-07T20:33:41.4392460Z             scale_ub_tensor = None
2025-05-07T20:33:41.4392541Z     
2025-05-07T20:33:41.4392670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4392762Z             op = silu_mul_quant
2025-05-07T20:33:41.4392850Z             if compiled:
2025-05-07T20:33:41.4392949Z                 op = torch.compile(op)
2025-05-07T20:33:41.4393055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4393135Z     
2025-05-07T20:33:41.4393228Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4393233Z 
2025-05-07T20:33:41.4393374Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4393517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4393619Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4393763Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4394256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4394355Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4394713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4394934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4395271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4395370Z     kernel = self.compile(
2025-05-07T20:33:41.4395804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4396030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4396162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4396166Z 
2025-05-07T20:33:41.4396372Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf11b50>
2025-05-07T20:33:41.4397152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4397657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed13a0>}
2025-05-07T20:33:41.4398409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4398602Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb31f830>
2025-05-07T20:33:41.4398610Z 
2025-05-07T20:33:41.4398772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4399040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4399149Z                            module_map=module_map)
2025-05-07T20:33:41.4399318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4399419Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4399502Z E       ^
2025-05-07T20:33:41.4399861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4399865Z 
2025-05-07T20:33:41.4400276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4400321Z 
2025-05-07T20:33:41.4400431Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4400654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4400734Z     T=1,
2025-05-07T20:33:41.4400814Z     D=7168,
2025-05-07T20:33:41.4400897Z     scale_ub=1200.0,
2025-05-07T20:33:41.4400986Z     contiguous=True,
2025-05-07T20:33:41.4401073Z     compiled=True,
2025-05-07T20:33:41.4401146Z )
2025-05-07T20:33:41.4401364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4401529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4401534Z 
2025-05-07T20:33:41.4401612Z     @given(
2025-05-07T20:33:41.4401732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4401833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4401993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4402116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4402234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4402307Z     )
2025-05-07T20:33:41.4402592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4402685Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4402762Z         self,
2025-05-07T20:33:41.4402841Z         T: int,
2025-05-07T20:33:41.4402918Z         D: int,
2025-05-07T20:33:41.4403022Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4403110Z         contiguous: bool,
2025-05-07T20:33:41.4403196Z         compiled: bool,
2025-05-07T20:33:41.4403279Z     ) -> None:
2025-05-07T20:33:41.4403373Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4403447Z     
2025-05-07T20:33:41.4403617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4403692Z     
2025-05-07T20:33:41.4403783Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4403914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4404046Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4404129Z         x0 = x[:, :D]
2025-05-07T20:33:41.4404215Z         x1 = x[:, D:]
2025-05-07T20:33:41.4404287Z     
2025-05-07T20:33:41.4404369Z         if contiguous:
2025-05-07T20:33:41.4404469Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4404557Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4404631Z     
2025-05-07T20:33:41.4404723Z         if scale_ub is not None:
2025-05-07T20:33:41.4404827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4404961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4405038Z             )
2025-05-07T20:33:41.4405116Z         else:
2025-05-07T20:33:41.4405213Z             scale_ub_tensor = None
2025-05-07T20:33:41.4405286Z     
2025-05-07T20:33:41.4405416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4405512Z             op = silu_mul_quant
2025-05-07T20:33:41.4405598Z             if compiled:
2025-05-07T20:33:41.4405702Z                 op = torch.compile(op)
2025-05-07T20:33:41.4405811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4405886Z     
2025-05-07T20:33:41.4405982Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4405987Z 
2025-05-07T20:33:41.4406083Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4406213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4406317Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4406418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4406783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4406879Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4407366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4407470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4407875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4408099Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4408440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4408535Z     kernel = self.compile(
2025-05-07T20:33:41.4408914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4409089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4409219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4409223Z 
2025-05-07T20:33:41.4409429Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf123f0>
2025-05-07T20:33:41.4410269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4410816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbed2b60>}
2025-05-07T20:33:41.4411559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4411749Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb351630>
2025-05-07T20:33:41.4411753Z 
2025-05-07T20:33:41.4411919Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4412181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4412295Z                            module_map=module_map)
2025-05-07T20:33:41.4412501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4412603Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4412685Z E       ^
2025-05-07T20:33:41.4413040Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4413045Z 
2025-05-07T20:33:41.4413453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4413462Z 
2025-05-07T20:33:41.4413564Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4413786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4413868Z     T=1,
2025-05-07T20:33:41.4413945Z     D=7168,
2025-05-07T20:33:41.4414029Z     scale_ub=1200.0,
2025-05-07T20:33:41.4414118Z     contiguous=False,
2025-05-07T20:33:41.4414203Z     compiled=True,
2025-05-07T20:33:41.4414277Z )
2025-05-07T20:33:41.4414503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4414669Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4414676Z 
2025-05-07T20:33:41.4414753Z     @given(
2025-05-07T20:33:41.4414875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4414973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4415089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4415205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4415316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4415397Z     )
2025-05-07T20:33:41.4415638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4415731Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4415810Z         self,
2025-05-07T20:33:41.4415889Z         T: int,
2025-05-07T20:33:41.4415969Z         D: int,
2025-05-07T20:33:41.4416130Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4416221Z         contiguous: bool,
2025-05-07T20:33:41.4416309Z         compiled: bool,
2025-05-07T20:33:41.4416390Z     ) -> None:
2025-05-07T20:33:41.4416484Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4416558Z     
2025-05-07T20:33:41.4416725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4416798Z     
2025-05-07T20:33:41.4416893Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4417018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4417105Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4417189Z         x0 = x[:, :D]
2025-05-07T20:33:41.4417270Z         x1 = x[:, D:]
2025-05-07T20:33:41.4417343Z     
2025-05-07T20:33:41.4417428Z         if contiguous:
2025-05-07T20:33:41.4417519Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4417606Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4417682Z     
2025-05-07T20:33:41.4417818Z         if scale_ub is not None:
2025-05-07T20:33:41.4417935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4418067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4418184Z             )
2025-05-07T20:33:41.4418266Z         else:
2025-05-07T20:33:41.4418363Z             scale_ub_tensor = None
2025-05-07T20:33:41.4418436Z     
2025-05-07T20:33:41.4418568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4418658Z             op = silu_mul_quant
2025-05-07T20:33:41.4418742Z             if compiled:
2025-05-07T20:33:41.4418847Z                 op = torch.compile(op)
2025-05-07T20:33:41.4418952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4419024Z     
2025-05-07T20:33:41.4419117Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4419122Z 
2025-05-07T20:33:41.4419218Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4419360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4419462Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4419609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4419981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4420077Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4420565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4420666Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4421022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4421245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4421581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4421675Z     kernel = self.compile(
2025-05-07T20:33:41.4422061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4422237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4422371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4422375Z 
2025-05-07T20:33:41.4422580Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaf13bc0>
2025-05-07T20:33:41.4423355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4423861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbeaba60>}
2025-05-07T20:33:41.4424609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4424928Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad56e70>
2025-05-07T20:33:41.4424932Z 
2025-05-07T20:33:41.4425094Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4425354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4425466Z                            module_map=module_map)
2025-05-07T20:33:41.4425625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4425727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4425803Z E       ^
2025-05-07T20:33:41.4426157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4426161Z 
2025-05-07T20:33:41.4426620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4426627Z 
2025-05-07T20:33:41.4426730Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4426997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4427076Z     T=1,
2025-05-07T20:33:41.4427153Z     D=7168,
2025-05-07T20:33:41.4427238Z     scale_ub=None,
2025-05-07T20:33:41.4427323Z     contiguous=False,
2025-05-07T20:33:41.4427404Z     compiled=True,
2025-05-07T20:33:41.4427480Z )
2025-05-07T20:33:41.4427697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4427861Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4427865Z 
2025-05-07T20:33:41.4427946Z     @given(
2025-05-07T20:33:41.4428063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4428163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4428290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4428448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4428566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4428642Z     )
2025-05-07T20:33:41.4428885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4428982Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4429058Z         self,
2025-05-07T20:33:41.4429134Z         T: int,
2025-05-07T20:33:41.4429213Z         D: int,
2025-05-07T20:33:41.4429312Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4429403Z         contiguous: bool,
2025-05-07T20:33:41.4429493Z         compiled: bool,
2025-05-07T20:33:41.4429571Z     ) -> None:
2025-05-07T20:33:41.4429665Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4429744Z     
2025-05-07T20:33:41.4429910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4429985Z     
2025-05-07T20:33:41.4430078Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4430207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4430300Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4430383Z         x0 = x[:, :D]
2025-05-07T20:33:41.4430464Z         x1 = x[:, D:]
2025-05-07T20:33:41.4430541Z     
2025-05-07T20:33:41.4430626Z         if contiguous:
2025-05-07T20:33:41.4430719Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4430813Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4430886Z     
2025-05-07T20:33:41.4430976Z         if scale_ub is not None:
2025-05-07T20:33:41.4431085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4434479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4434569Z             )
2025-05-07T20:33:41.4434656Z         else:
2025-05-07T20:33:41.4434755Z             scale_ub_tensor = None
2025-05-07T20:33:41.4434832Z     
2025-05-07T20:33:41.4434965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4435067Z             op = silu_mul_quant
2025-05-07T20:33:41.4435232Z             if compiled:
2025-05-07T20:33:41.4435334Z                 op = torch.compile(op)
2025-05-07T20:33:41.4435443Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4435518Z     
2025-05-07T20:33:41.4435608Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.4435817Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.4435894Z     
2025-05-07T20:33:41.4436029Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4436130Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.4436232Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.4436353Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.4436494Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4436568Z     
2025-05-07T20:33:41.4436667Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.4436671Z 
2025-05-07T20:33:41.4436823Z moe/activation_test.py:126: 
2025-05-07T20:33:41.4436958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4437065Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.4437243Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.4437800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.4437902Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.4438259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4438481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4438851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.4439109Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.4439526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.4439699Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.4440037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.4440117Z     fn()
2025-05-07T20:33:41.4440514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.4440595Z     self.fn.run(
2025-05-07T20:33:41.4440934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4441027Z     kernel = self.compile(
2025-05-07T20:33:41.4441409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4441587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4441721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4441729Z 
2025-05-07T20:33:41.4441937Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5eea0>
2025-05-07T20:33:41.4442715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4443222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd4c20>}
2025-05-07T20:33:41.4443967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4444233Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad062f0>
2025-05-07T20:33:41.4444237Z 
2025-05-07T20:33:41.4444409Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4444672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4444787Z                            module_map=module_map)
2025-05-07T20:33:41.4444949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4445053Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.4445134Z E       ^
2025-05-07T20:33:41.4445489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4445493Z 
2025-05-07T20:33:41.4445902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4445911Z 
2025-05-07T20:33:41.4446058Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4446288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4446367Z     T=1,
2025-05-07T20:33:41.4446484Z     D=5120,
2025-05-07T20:33:41.4446566Z     scale_ub=1200.0,
2025-05-07T20:33:41.4446654Z     contiguous=False,
2025-05-07T20:33:41.4446736Z     compiled=True,
2025-05-07T20:33:41.4446810Z )
2025-05-07T20:33:41.4447031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4447196Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4447201Z 
2025-05-07T20:33:41.4447280Z     @given(
2025-05-07T20:33:41.4447399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4447499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4447615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4447730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4447847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4447970Z     )
2025-05-07T20:33:41.4448215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4448313Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4448391Z         self,
2025-05-07T20:33:41.4448470Z         T: int,
2025-05-07T20:33:41.4448546Z         D: int,
2025-05-07T20:33:41.4448648Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4448737Z         contiguous: bool,
2025-05-07T20:33:41.4448824Z         compiled: bool,
2025-05-07T20:33:41.4448902Z     ) -> None:
2025-05-07T20:33:41.4448997Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4449072Z     
2025-05-07T20:33:41.4449240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4449315Z     
2025-05-07T20:33:41.4449410Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4449537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4449629Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4449715Z         x0 = x[:, :D]
2025-05-07T20:33:41.4449799Z         x1 = x[:, D:]
2025-05-07T20:33:41.4449872Z     
2025-05-07T20:33:41.4449959Z         if contiguous:
2025-05-07T20:33:41.4450053Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4450145Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4450220Z     
2025-05-07T20:33:41.4450309Z         if scale_ub is not None:
2025-05-07T20:33:41.4450420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4450557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4450632Z             )
2025-05-07T20:33:41.4450711Z         else:
2025-05-07T20:33:41.4450804Z             scale_ub_tensor = None
2025-05-07T20:33:41.4450877Z     
2025-05-07T20:33:41.4451009Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4451098Z             op = silu_mul_quant
2025-05-07T20:33:41.4451184Z             if compiled:
2025-05-07T20:33:41.4451290Z                 op = torch.compile(op)
2025-05-07T20:33:41.4451444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4451523Z     
2025-05-07T20:33:41.4451615Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4451622Z 
2025-05-07T20:33:41.4451718Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4451852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4451953Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4452053Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4452429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4452521Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4453015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4453112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4453509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4453739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4454119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4454213Z     kernel = self.compile(
2025-05-07T20:33:41.4454592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4454766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4454904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4454908Z 
2025-05-07T20:33:41.4455112Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5e7b0>
2025-05-07T20:33:41.4455930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4456443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd5ee0>}
2025-05-07T20:33:41.4457188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4457384Z context = <triton._C.libtriton.ir.context object at 0x7f2bfad1d3b0>
2025-05-07T20:33:41.4457388Z 
2025-05-07T20:33:41.4457556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4457816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4457928Z                            module_map=module_map)
2025-05-07T20:33:41.4458092Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4458202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4458282Z E       ^
2025-05-07T20:33:41.4458636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4458643Z 
2025-05-07T20:33:41.4459058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4459062Z 
2025-05-07T20:33:41.4459166Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4459390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4459472Z     T=1,
2025-05-07T20:33:41.4459553Z     D=5120,
2025-05-07T20:33:41.4459643Z     scale_ub=1200.0,
2025-05-07T20:33:41.4459730Z     contiguous=False,
2025-05-07T20:33:41.4459815Z     compiled=False,
2025-05-07T20:33:41.4459893Z )
2025-05-07T20:33:41.4460116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4460334Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4460338Z 
2025-05-07T20:33:41.4460421Z     @given(
2025-05-07T20:33:41.4460542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4460646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4460761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4460878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4460993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4461068Z     )
2025-05-07T20:33:41.4461309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4461406Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4461483Z         self,
2025-05-07T20:33:41.4461563Z         T: int,
2025-05-07T20:33:41.4461645Z         D: int,
2025-05-07T20:33:41.4461745Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4461877Z         contiguous: bool,
2025-05-07T20:33:41.4461972Z         compiled: bool,
2025-05-07T20:33:41.4462050Z     ) -> None:
2025-05-07T20:33:41.4462148Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4462264Z     
2025-05-07T20:33:41.4462431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4462510Z     
2025-05-07T20:33:41.4462601Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4462725Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4462821Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4462900Z         x0 = x[:, :D]
2025-05-07T20:33:41.4462980Z         x1 = x[:, D:]
2025-05-07T20:33:41.4463056Z     
2025-05-07T20:33:41.4463138Z         if contiguous:
2025-05-07T20:33:41.4463230Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4463321Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4463393Z     
2025-05-07T20:33:41.4463485Z         if scale_ub is not None:
2025-05-07T20:33:41.4463592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4463775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4463857Z             )
2025-05-07T20:33:41.4463935Z         else:
2025-05-07T20:33:41.4464033Z             scale_ub_tensor = None
2025-05-07T20:33:41.4464110Z     
2025-05-07T20:33:41.4464242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4464334Z             op = silu_mul_quant
2025-05-07T20:33:41.4464421Z             if compiled:
2025-05-07T20:33:41.4464518Z                 op = torch.compile(op)
2025-05-07T20:33:41.4464623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4464705Z     
2025-05-07T20:33:41.4464796Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4464801Z 
2025-05-07T20:33:41.4464898Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4465027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4465127Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4465233Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4466002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4466111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4466475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4466696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4467038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4467131Z     kernel = self.compile(
2025-05-07T20:33:41.4467509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4467686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4467816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4467911Z 
2025-05-07T20:33:41.4468123Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5f830>
2025-05-07T20:33:41.4468900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4469403Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfadd6b60>}
2025-05-07T20:33:41.4470151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4470340Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa5b770>
2025-05-07T20:33:41.4470406Z 
2025-05-07T20:33:41.4470577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4470840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4471006Z                            module_map=module_map)
2025-05-07T20:33:41.4471171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4471270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4471346Z E       ^
2025-05-07T20:33:41.4471702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4471706Z 
2025-05-07T20:33:41.4472116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4472120Z 
2025-05-07T20:33:41.4472227Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4472454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4472536Z     T=16384,
2025-05-07T20:33:41.4472704Z     D=5120,
2025-05-07T20:33:41.4472788Z     scale_ub=1200.0,
2025-05-07T20:33:41.4472876Z     contiguous=False,
2025-05-07T20:33:41.4472961Z     compiled=True,
2025-05-07T20:33:41.4473034Z )
2025-05-07T20:33:41.4473256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4473435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4473439Z 
2025-05-07T20:33:41.4473515Z     @given(
2025-05-07T20:33:41.4473636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4473737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4473850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4473969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4474080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4474156Z     )
2025-05-07T20:33:41.4474401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4474500Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4474578Z         self,
2025-05-07T20:33:41.4474658Z         T: int,
2025-05-07T20:33:41.4474734Z         D: int,
2025-05-07T20:33:41.4474837Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4474924Z         contiguous: bool,
2025-05-07T20:33:41.4475010Z         compiled: bool,
2025-05-07T20:33:41.4475093Z     ) -> None:
2025-05-07T20:33:41.4475188Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4475262Z     
2025-05-07T20:33:41.4475432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4475506Z     
2025-05-07T20:33:41.4475596Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4475824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4475913Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4475998Z         x0 = x[:, :D]
2025-05-07T20:33:41.4476078Z         x1 = x[:, D:]
2025-05-07T20:33:41.4476154Z     
2025-05-07T20:33:41.4476294Z         if contiguous:
2025-05-07T20:33:41.4476387Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4476477Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4476555Z     
2025-05-07T20:33:41.4476646Z         if scale_ub is not None:
2025-05-07T20:33:41.4476752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4476888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4476964Z             )
2025-05-07T20:33:41.4477040Z         else:
2025-05-07T20:33:41.4477135Z             scale_ub_tensor = None
2025-05-07T20:33:41.4477207Z     
2025-05-07T20:33:41.4477338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4477428Z             op = silu_mul_quant
2025-05-07T20:33:41.4477512Z             if compiled:
2025-05-07T20:33:41.4477614Z                 op = torch.compile(op)
2025-05-07T20:33:41.4477720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4477836Z     
2025-05-07T20:33:41.4477933Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4477942Z 
2025-05-07T20:33:41.4478038Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4478167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4478314Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4478415Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4478785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4478878Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4479368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4479467Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4479831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4480054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4480436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4480532Z     kernel = self.compile(
2025-05-07T20:33:41.4480915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4481088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4481214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4481219Z 
2025-05-07T20:33:41.4481429Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfad5ff80>
2025-05-07T20:33:41.4482208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4482722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c8220>}
2025-05-07T20:33:41.4483472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4483669Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb449c70>
2025-05-07T20:33:41.4483673Z 
2025-05-07T20:33:41.4483838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4484099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4484211Z                            module_map=module_map)
2025-05-07T20:33:41.4484373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4484472Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4484561Z E       ^
2025-05-07T20:33:41.4484960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4484967Z 
2025-05-07T20:33:41.4485385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4485389Z 
2025-05-07T20:33:41.4485491Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4485711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4485794Z     T=2048,
2025-05-07T20:33:41.4485872Z     D=7168,
2025-05-07T20:33:41.4485955Z     scale_ub=1200.0,
2025-05-07T20:33:41.4486047Z     contiguous=False,
2025-05-07T20:33:41.4486128Z     compiled=True,
2025-05-07T20:33:41.4486201Z )
2025-05-07T20:33:41.4486421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4486636Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4486644Z 
2025-05-07T20:33:41.4486729Z     @given(
2025-05-07T20:33:41.4486849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4486947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4487109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4487224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4487341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4487419Z     )
2025-05-07T20:33:41.4487660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4487758Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4487835Z         self,
2025-05-07T20:33:41.4487912Z         T: int,
2025-05-07T20:33:41.4487990Z         D: int,
2025-05-07T20:33:41.4488088Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4488176Z         contiguous: bool,
2025-05-07T20:33:41.4488264Z         compiled: bool,
2025-05-07T20:33:41.4488346Z     ) -> None:
2025-05-07T20:33:41.4488441Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4488561Z     
2025-05-07T20:33:41.4488728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4488807Z     
2025-05-07T20:33:41.4488902Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4489030Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4489125Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4489205Z         x0 = x[:, :D]
2025-05-07T20:33:41.4489286Z         x1 = x[:, D:]
2025-05-07T20:33:41.4489362Z     
2025-05-07T20:33:41.4489446Z         if contiguous:
2025-05-07T20:33:41.4489538Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4489633Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4489707Z     
2025-05-07T20:33:41.4489797Z         if scale_ub is not None:
2025-05-07T20:33:41.4489907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4490039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4490119Z             )
2025-05-07T20:33:41.4490199Z         else:
2025-05-07T20:33:41.4490296Z             scale_ub_tensor = None
2025-05-07T20:33:41.4490369Z     
2025-05-07T20:33:41.4490502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4490592Z             op = silu_mul_quant
2025-05-07T20:33:41.4490682Z             if compiled:
2025-05-07T20:33:41.4490780Z                 op = torch.compile(op)
2025-05-07T20:33:41.4490884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4490959Z     
2025-05-07T20:33:41.4491047Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4491052Z 
2025-05-07T20:33:41.4491150Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4491285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4491383Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4491481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4491853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4491994Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4492484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4492584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4492939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4493168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4493506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4493604Z     kernel = self.compile(
2025-05-07T20:33:41.4493981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4494197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4494335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4494340Z 
2025-05-07T20:33:41.4494543Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e1e20>
2025-05-07T20:33:41.4495356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4495864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c8f40>}
2025-05-07T20:33:41.4496605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4496805Z context = <triton._C.libtriton.ir.context object at 0x7f2cb0638c70>
2025-05-07T20:33:41.4496812Z 
2025-05-07T20:33:41.4497015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4497284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4497393Z                            module_map=module_map)
2025-05-07T20:33:41.4497553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4497655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4497733Z E       ^
2025-05-07T20:33:41.4498090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4498094Z 
2025-05-07T20:33:41.4498507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4498512Z 
2025-05-07T20:33:41.4498614Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4498848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4498928Z     T=1,
2025-05-07T20:33:41.4499004Z     D=5120,
2025-05-07T20:33:41.4499089Z     scale_ub=None,
2025-05-07T20:33:41.4499178Z     contiguous=False,
2025-05-07T20:33:41.4499261Z     compiled=False,
2025-05-07T20:33:41.4499336Z )
2025-05-07T20:33:41.4499552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4499722Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4499726Z 
2025-05-07T20:33:41.4499803Z     @given(
2025-05-07T20:33:41.4499921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4500023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4500139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4500255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4500369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4500448Z     )
2025-05-07T20:33:41.4500742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4500834Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4500913Z         self,
2025-05-07T20:33:41.4500996Z         T: int,
2025-05-07T20:33:41.4501073Z         D: int,
2025-05-07T20:33:41.4501171Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4501261Z         contiguous: bool,
2025-05-07T20:33:41.4501347Z         compiled: bool,
2025-05-07T20:33:41.4501425Z     ) -> None:
2025-05-07T20:33:41.4501526Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4501598Z     
2025-05-07T20:33:41.4501766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4501844Z     
2025-05-07T20:33:41.4501936Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4502062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4502150Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4502276Z         x0 = x[:, :D]
2025-05-07T20:33:41.4502362Z         x1 = x[:, D:]
2025-05-07T20:33:41.4502438Z     
2025-05-07T20:33:41.4502521Z         if contiguous:
2025-05-07T20:33:41.4502617Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4502772Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4502844Z     
2025-05-07T20:33:41.4502940Z         if scale_ub is not None:
2025-05-07T20:33:41.4503046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4503179Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4503259Z             )
2025-05-07T20:33:41.4503336Z         else:
2025-05-07T20:33:41.4503430Z             scale_ub_tensor = None
2025-05-07T20:33:41.4503506Z     
2025-05-07T20:33:41.4503634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4503727Z             op = silu_mul_quant
2025-05-07T20:33:41.4503812Z             if compiled:
2025-05-07T20:33:41.4503911Z                 op = torch.compile(op)
2025-05-07T20:33:41.4504024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4504099Z     
2025-05-07T20:33:41.4504235Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4504240Z 
2025-05-07T20:33:41.4504340Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4504473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4504573Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4504674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4505167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4505267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4505621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4505841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4506184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4506281Z     kernel = self.compile(
2025-05-07T20:33:41.4506661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4506837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4506967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4506971Z 
2025-05-07T20:33:41.4507177Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e3080>
2025-05-07T20:33:41.4507951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4508459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4c9ee0>}
2025-05-07T20:33:41.4509253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4509447Z context = <triton._C.libtriton.ir.context object at 0x7f2bfac68bf0>
2025-05-07T20:33:41.4509451Z 
2025-05-07T20:33:41.4509618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4509878Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4509988Z                            module_map=module_map)
2025-05-07T20:33:41.4510149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4510248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4510330Z E       ^
2025-05-07T20:33:41.4510724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4510732Z 
2025-05-07T20:33:41.4511145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4511191Z 
2025-05-07T20:33:41.4511295Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4511517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4511598Z     T=4096,
2025-05-07T20:33:41.4511676Z     D=7168,
2025-05-07T20:33:41.4511761Z     scale_ub=1200.0,
2025-05-07T20:33:41.4511850Z     contiguous=False,
2025-05-07T20:33:41.4511935Z     compiled=False,
2025-05-07T20:33:41.4512009Z )
2025-05-07T20:33:41.4512229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4512407Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4512412Z 
2025-05-07T20:33:41.4512495Z     @given(
2025-05-07T20:33:41.4512617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4512762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4512882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4513003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4513118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4513202Z     )
2025-05-07T20:33:41.4513445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4513539Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4513620Z         self,
2025-05-07T20:33:41.4513698Z         T: int,
2025-05-07T20:33:41.4513776Z         D: int,
2025-05-07T20:33:41.4513879Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4513968Z         contiguous: bool,
2025-05-07T20:33:41.4514058Z         compiled: bool,
2025-05-07T20:33:41.4514138Z     ) -> None:
2025-05-07T20:33:41.4514234Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4514313Z     
2025-05-07T20:33:41.4514485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4514567Z     
2025-05-07T20:33:41.4514663Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4514788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4514881Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4514967Z         x0 = x[:, :D]
2025-05-07T20:33:41.4515048Z         x1 = x[:, D:]
2025-05-07T20:33:41.4515121Z     
2025-05-07T20:33:41.4515209Z         if contiguous:
2025-05-07T20:33:41.4515301Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4515393Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4515466Z     
2025-05-07T20:33:41.4515557Z         if scale_ub is not None:
2025-05-07T20:33:41.4515712Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4515850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4515929Z             )
2025-05-07T20:33:41.4516013Z         else:
2025-05-07T20:33:41.4516112Z             scale_ub_tensor = None
2025-05-07T20:33:41.4516187Z     
2025-05-07T20:33:41.4516371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4516461Z             op = silu_mul_quant
2025-05-07T20:33:41.4516547Z             if compiled:
2025-05-07T20:33:41.4516649Z                 op = torch.compile(op)
2025-05-07T20:33:41.4516753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4516827Z     
2025-05-07T20:33:41.4516917Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4516922Z 
2025-05-07T20:33:41.4517019Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4517150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4517254Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4517352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4517850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4518595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4518965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4519190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4519569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4519665Z     kernel = self.compile(
2025-05-07T20:33:41.4520042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4520217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4520350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4520354Z 
2025-05-07T20:33:41.4520558Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfb4e2c00>
2025-05-07T20:33:41.4521382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4521892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfb4cb420>}
2025-05-07T20:33:41.4522640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4522832Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa2ad30>
2025-05-07T20:33:41.4522836Z 
2025-05-07T20:33:41.4523001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4523266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4523376Z                            module_map=module_map)
2025-05-07T20:33:41.4523543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4523648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4523726Z E       ^
2025-05-07T20:33:41.4524084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4524088Z 
2025-05-07T20:33:41.4524499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4524503Z 
2025-05-07T20:33:41.4524606Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4524833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4524910Z     T=16384,
2025-05-07T20:33:41.4524989Z     D=7168,
2025-05-07T20:33:41.4525071Z     scale_ub=None,
2025-05-07T20:33:41.4525155Z     contiguous=True,
2025-05-07T20:33:41.4525240Z     compiled=True,
2025-05-07T20:33:41.4525316Z )
2025-05-07T20:33:41.4525580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4525754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4525762Z 
2025-05-07T20:33:41.4525842Z     @given(
2025-05-07T20:33:41.4525960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4526062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4526178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4526297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4526408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4526482Z     )
2025-05-07T20:33:41.4526728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4526820Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4526896Z         self,
2025-05-07T20:33:41.4526976Z         T: int,
2025-05-07T20:33:41.4527092Z         D: int,
2025-05-07T20:33:41.4527193Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4527288Z         contiguous: bool,
2025-05-07T20:33:41.4527374Z         compiled: bool,
2025-05-07T20:33:41.4527495Z     ) -> None:
2025-05-07T20:33:41.4527592Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4527665Z     
2025-05-07T20:33:41.4527831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4527907Z     
2025-05-07T20:33:41.4527998Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4528124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4528214Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4528294Z         x0 = x[:, :D]
2025-05-07T20:33:41.4528377Z         x1 = x[:, D:]
2025-05-07T20:33:41.4528451Z     
2025-05-07T20:33:41.4528534Z         if contiguous:
2025-05-07T20:33:41.4528629Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4528718Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4528789Z     
2025-05-07T20:33:41.4528888Z         if scale_ub is not None:
2025-05-07T20:33:41.4529039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4529174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4529257Z             )
2025-05-07T20:33:41.4529334Z         else:
2025-05-07T20:33:41.4529431Z             scale_ub_tensor = None
2025-05-07T20:33:41.4529503Z     
2025-05-07T20:33:41.4529631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4529724Z             op = silu_mul_quant
2025-05-07T20:33:41.4529809Z             if compiled:
2025-05-07T20:33:41.4529907Z                 op = torch.compile(op)
2025-05-07T20:33:41.4530015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4530088Z     
2025-05-07T20:33:41.4530178Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4530182Z 
2025-05-07T20:33:41.4530280Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4530410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4530516Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4530621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4530985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4531085Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4531573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4531670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4532029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4532251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4532595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4532687Z     kernel = self.compile(
2025-05-07T20:33:41.4533071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4533293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4533423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4533427Z 
2025-05-07T20:33:41.4533631Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa484a0>
2025-05-07T20:33:41.4534408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4534910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5c540>}
2025-05-07T20:33:41.4535721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4535917Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaa0a430>
2025-05-07T20:33:41.4535960Z 
2025-05-07T20:33:41.4536131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4536393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4536500Z                            module_map=module_map)
2025-05-07T20:33:41.4536664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4536763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4536841Z E       ^
2025-05-07T20:33:41.4537200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4537204Z 
2025-05-07T20:33:41.4537656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4537664Z 
2025-05-07T20:33:41.4537770Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4537996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4538074Z     T=4096,
2025-05-07T20:33:41.4538155Z     D=5120,
2025-05-07T20:33:41.4538236Z     scale_ub=None,
2025-05-07T20:33:41.4538323Z     contiguous=False,
2025-05-07T20:33:41.4538409Z     compiled=True,
2025-05-07T20:33:41.4538483Z )
2025-05-07T20:33:41.4538702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4538874Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4538878Z 
2025-05-07T20:33:41.4538954Z     @given(
2025-05-07T20:33:41.4539077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4539175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4539293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4539423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4539536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4539614Z     )
2025-05-07T20:33:41.4539857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4539950Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4540028Z         self,
2025-05-07T20:33:41.4540105Z         T: int,
2025-05-07T20:33:41.4540180Z         D: int,
2025-05-07T20:33:41.4540281Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4540369Z         contiguous: bool,
2025-05-07T20:33:41.4540453Z         compiled: bool,
2025-05-07T20:33:41.4540532Z     ) -> None:
2025-05-07T20:33:41.4540625Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4540698Z     
2025-05-07T20:33:41.4540867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4540941Z     
2025-05-07T20:33:41.4541035Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4541211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4541299Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4541385Z         x0 = x[:, :D]
2025-05-07T20:33:41.4541464Z         x1 = x[:, D:]
2025-05-07T20:33:41.4541535Z     
2025-05-07T20:33:41.4541621Z         if contiguous:
2025-05-07T20:33:41.4541711Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4541801Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4541882Z     
2025-05-07T20:33:41.4541973Z         if scale_ub is not None:
2025-05-07T20:33:41.4542078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4542219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4542295Z             )
2025-05-07T20:33:41.4542371Z         else:
2025-05-07T20:33:41.4542469Z             scale_ub_tensor = None
2025-05-07T20:33:41.4542541Z     
2025-05-07T20:33:41.4542671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4542806Z             op = silu_mul_quant
2025-05-07T20:33:41.4542898Z             if compiled:
2025-05-07T20:33:41.4542999Z                 op = torch.compile(op)
2025-05-07T20:33:41.4543142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4543216Z     
2025-05-07T20:33:41.4543311Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4543315Z 
2025-05-07T20:33:41.4543413Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4543543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4543646Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4543746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4544112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4544205Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4544698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4544800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4545194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4545418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4545758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4545851Z     kernel = self.compile(
2025-05-07T20:33:41.4546232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4546406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4546536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4546540Z 
2025-05-07T20:33:41.4546747Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa48fb0>
2025-05-07T20:33:41.4547529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4548042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5d260>}
2025-05-07T20:33:41.4548784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4548974Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaab1bf0>
2025-05-07T20:33:41.4548982Z 
2025-05-07T20:33:41.4549145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4549407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4549575Z                            module_map=module_map)
2025-05-07T20:33:41.4549735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4549837Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4549918Z E       ^
2025-05-07T20:33:41.4550271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4550276Z 
2025-05-07T20:33:41.4550688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4550692Z 
2025-05-07T20:33:41.4550794Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4551016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4551096Z     T=4096,
2025-05-07T20:33:41.4551172Z     D=5120,
2025-05-07T20:33:41.4551254Z     scale_ub=1200.0,
2025-05-07T20:33:41.4551385Z     contiguous=False,
2025-05-07T20:33:41.4551471Z     compiled=False,
2025-05-07T20:33:41.4551545Z )
2025-05-07T20:33:41.4551768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4551985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4551990Z 
2025-05-07T20:33:41.4552072Z     @given(
2025-05-07T20:33:41.4552190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4552288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4552406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4552522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4552634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4552711Z     )
2025-05-07T20:33:41.4552954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4553049Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4553125Z         self,
2025-05-07T20:33:41.4553206Z         T: int,
2025-05-07T20:33:41.4553287Z         D: int,
2025-05-07T20:33:41.4553509Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4553599Z         contiguous: bool,
2025-05-07T20:33:41.4553692Z         compiled: bool,
2025-05-07T20:33:41.4553771Z     ) -> None:
2025-05-07T20:33:41.4553866Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4553940Z     
2025-05-07T20:33:41.4554107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4554180Z     
2025-05-07T20:33:41.4554275Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4554398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4554487Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4554570Z         x0 = x[:, :D]
2025-05-07T20:33:41.4554651Z         x1 = x[:, D:]
2025-05-07T20:33:41.4554726Z     
2025-05-07T20:33:41.4554809Z         if contiguous:
2025-05-07T20:33:41.4554900Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4554994Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4555068Z     
2025-05-07T20:33:41.4558350Z         if scale_ub is not None:
2025-05-07T20:33:41.4558474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4558615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4558697Z             )
2025-05-07T20:33:41.4558777Z         else:
2025-05-07T20:33:41.4558872Z             scale_ub_tensor = None
2025-05-07T20:33:41.4558947Z     
2025-05-07T20:33:41.4559081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4559171Z             op = silu_mul_quant
2025-05-07T20:33:41.4559257Z             if compiled:
2025-05-07T20:33:41.4559359Z                 op = torch.compile(op)
2025-05-07T20:33:41.4559463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4559541Z     
2025-05-07T20:33:41.4559632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4559637Z 
2025-05-07T20:33:41.4559734Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4559867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4560040Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4560140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4560644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4560739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4561097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4561320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4561657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4561754Z     kernel = self.compile(
2025-05-07T20:33:41.4562181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4562359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4562491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4562533Z 
2025-05-07T20:33:41.4562738Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa4b560>
2025-05-07T20:33:41.4563518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4564023Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5e200>}
2025-05-07T20:33:41.4564771Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4565003Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa99d3f0>
2025-05-07T20:33:41.4565008Z 
2025-05-07T20:33:41.4565171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4565682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4565807Z                            module_map=module_map)
2025-05-07T20:33:41.4565967Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4566069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4566149Z E       ^
2025-05-07T20:33:41.4566504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4566508Z 
2025-05-07T20:33:41.4566916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4566924Z 
2025-05-07T20:33:41.4567028Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4567259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4567340Z     T=4096,
2025-05-07T20:33:41.4567419Z     D=5120,
2025-05-07T20:33:41.4567502Z     scale_ub=1200.0,
2025-05-07T20:33:41.4567587Z     contiguous=False,
2025-05-07T20:33:41.4567673Z     compiled=True,
2025-05-07T20:33:41.4567747Z )
2025-05-07T20:33:41.4567963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4568143Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4568147Z 
2025-05-07T20:33:41.4568226Z     @given(
2025-05-07T20:33:41.4568345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4568450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4568564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4568687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4568801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4568995Z     )
2025-05-07T20:33:41.4569242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4569338Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4569415Z         self,
2025-05-07T20:33:41.4569495Z         T: int,
2025-05-07T20:33:41.4569571Z         D: int,
2025-05-07T20:33:41.4569669Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4569764Z         contiguous: bool,
2025-05-07T20:33:41.4569849Z         compiled: bool,
2025-05-07T20:33:41.4569927Z     ) -> None:
2025-05-07T20:33:41.4570024Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4570098Z     
2025-05-07T20:33:41.4570266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4570340Z     
2025-05-07T20:33:41.4570431Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4570558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4570715Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4570806Z         x0 = x[:, :D]
2025-05-07T20:33:41.4570888Z         x1 = x[:, D:]
2025-05-07T20:33:41.4570959Z     
2025-05-07T20:33:41.4571101Z         if contiguous:
2025-05-07T20:33:41.4571194Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4571283Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4571354Z     
2025-05-07T20:33:41.4571445Z         if scale_ub is not None:
2025-05-07T20:33:41.4571549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4571680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4571760Z             )
2025-05-07T20:33:41.4571837Z         else:
2025-05-07T20:33:41.4571932Z             scale_ub_tensor = None
2025-05-07T20:33:41.4572003Z     
2025-05-07T20:33:41.4572131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4572223Z             op = silu_mul_quant
2025-05-07T20:33:41.4572306Z             if compiled:
2025-05-07T20:33:41.4572411Z                 op = torch.compile(op)
2025-05-07T20:33:41.4572579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4572651Z     
2025-05-07T20:33:41.4572741Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4572749Z 
2025-05-07T20:33:41.4572852Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4572979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4573081Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4573179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4573543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4573642Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4574128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4574223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4574585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4574807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4575148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4575241Z     kernel = self.compile(
2025-05-07T20:33:41.4575617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4575795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4575922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4575927Z 
2025-05-07T20:33:41.4576132Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfaa49a30>
2025-05-07T20:33:41.4576910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4577457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfaa5f2e0>}
2025-05-07T20:33:41.4578205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4578394Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa9f7bb0>
2025-05-07T20:33:41.4578399Z 
2025-05-07T20:33:41.4578562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4578821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4578926Z                            module_map=module_map)
2025-05-07T20:33:41.4579131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4579235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4579314Z E       ^
2025-05-07T20:33:41.4579715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4579719Z 
2025-05-07T20:33:41.4580126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4580130Z 
2025-05-07T20:33:41.4580238Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4580458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4580533Z     T=2048,
2025-05-07T20:33:41.4580613Z     D=7168,
2025-05-07T20:33:41.4580696Z     scale_ub=1200.0,
2025-05-07T20:33:41.4580784Z     contiguous=False,
2025-05-07T20:33:41.4580868Z     compiled=False,
2025-05-07T20:33:41.4580940Z )
2025-05-07T20:33:41.4581163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4581382Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4581387Z 
2025-05-07T20:33:41.4581471Z     @given(
2025-05-07T20:33:41.4581594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4581696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4581809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4581926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4582037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4582114Z     )
2025-05-07T20:33:41.4582355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4582448Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4582530Z         self,
2025-05-07T20:33:41.4582605Z         T: int,
2025-05-07T20:33:41.4582681Z         D: int,
2025-05-07T20:33:41.4582785Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4582875Z         contiguous: bool,
2025-05-07T20:33:41.4582963Z         compiled: bool,
2025-05-07T20:33:41.4583042Z     ) -> None:
2025-05-07T20:33:41.4583136Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4583212Z     
2025-05-07T20:33:41.4583380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4583454Z     
2025-05-07T20:33:41.4583544Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4583669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4583756Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4583839Z         x0 = x[:, :D]
2025-05-07T20:33:41.4583917Z         x1 = x[:, D:]
2025-05-07T20:33:41.4583989Z     
2025-05-07T20:33:41.4584075Z         if contiguous:
2025-05-07T20:33:41.4584164Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4584250Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4584324Z     
2025-05-07T20:33:41.4584413Z         if scale_ub is not None:
2025-05-07T20:33:41.4584524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4584709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4584784Z             )
2025-05-07T20:33:41.4584864Z         else:
2025-05-07T20:33:41.4584960Z             scale_ub_tensor = None
2025-05-07T20:33:41.4585032Z     
2025-05-07T20:33:41.4585164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4585252Z             op = silu_mul_quant
2025-05-07T20:33:41.4585336Z             if compiled:
2025-05-07T20:33:41.4585435Z                 op = torch.compile(op)
2025-05-07T20:33:41.4585538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4585608Z     
2025-05-07T20:33:41.4585701Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4585705Z 
2025-05-07T20:33:41.4585799Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4585926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4586071Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4586171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4586675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4586815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4587169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4587392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4587728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4587819Z     kernel = self.compile(
2025-05-07T20:33:41.4588198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4588368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4588501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4588545Z 
2025-05-07T20:33:41.4588749Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cc3b0>
2025-05-07T20:33:41.4589522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4590077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a42c0>}
2025-05-07T20:33:41.4590817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4591011Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbb436b0>
2025-05-07T20:33:41.4591015Z 
2025-05-07T20:33:41.4591181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4591443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4591550Z                            module_map=module_map)
2025-05-07T20:33:41.4591710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4591814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4591890Z E       ^
2025-05-07T20:33:41.4592240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4592244Z 
2025-05-07T20:33:41.4592655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4592660Z 
2025-05-07T20:33:41.4592760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4592985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4593105Z     T=1,
2025-05-07T20:33:41.4593183Z     D=7168,
2025-05-07T20:33:41.4593266Z     scale_ub=None,
2025-05-07T20:33:41.4593351Z     contiguous=True,
2025-05-07T20:33:41.4593434Z     compiled=False,
2025-05-07T20:33:41.4593507Z )
2025-05-07T20:33:41.4593721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4593883Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4593891Z 
2025-05-07T20:33:41.4593967Z     @given(
2025-05-07T20:33:41.4594083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4594184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4594296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4594411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4594527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4594601Z     )
2025-05-07T20:33:41.4594886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4594988Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4595063Z         self,
2025-05-07T20:33:41.4595179Z         T: int,
2025-05-07T20:33:41.4595257Z         D: int,
2025-05-07T20:33:41.4595356Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4595448Z         contiguous: bool,
2025-05-07T20:33:41.4595534Z         compiled: bool,
2025-05-07T20:33:41.4595610Z     ) -> None:
2025-05-07T20:33:41.4595759Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4595832Z     
2025-05-07T20:33:41.4595999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4596074Z     
2025-05-07T20:33:41.4596165Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4596287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4596377Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4596458Z         x0 = x[:, :D]
2025-05-07T20:33:41.4596543Z         x1 = x[:, D:]
2025-05-07T20:33:41.4596622Z     
2025-05-07T20:33:41.4596773Z         if contiguous:
2025-05-07T20:33:41.4596870Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4596957Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4597031Z     
2025-05-07T20:33:41.4597123Z         if scale_ub is not None:
2025-05-07T20:33:41.4597227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4597360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4597437Z             )
2025-05-07T20:33:41.4597513Z         else:
2025-05-07T20:33:41.4597606Z             scale_ub_tensor = None
2025-05-07T20:33:41.4597681Z     
2025-05-07T20:33:41.4597808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4597896Z             op = silu_mul_quant
2025-05-07T20:33:41.4597985Z             if compiled:
2025-05-07T20:33:41.4598084Z                 op = torch.compile(op)
2025-05-07T20:33:41.4598191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4598265Z     
2025-05-07T20:33:41.4598357Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4598364Z 
2025-05-07T20:33:41.4598460Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4598587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4598692Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4598792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4599285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4599380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4599735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4599988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4600343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4600481Z     kernel = self.compile(
2025-05-07T20:33:41.4600862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4601039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4601166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4601171Z 
2025-05-07T20:33:41.4601378Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9ce570>
2025-05-07T20:33:41.4602151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4602703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a51c0>}
2025-05-07T20:33:41.4603450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4603685Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb0b00f0>
2025-05-07T20:33:41.4603690Z 
2025-05-07T20:33:41.4603856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4604116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4604221Z                            module_map=module_map)
2025-05-07T20:33:41.4604383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4604481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4604561Z E       ^
2025-05-07T20:33:41.4604916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4604920Z 
2025-05-07T20:33:41.4605377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4605384Z 
2025-05-07T20:33:41.4605489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4605709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4605788Z     T=16384,
2025-05-07T20:33:41.4605864Z     D=7168,
2025-05-07T20:33:41.4605948Z     scale_ub=1200.0,
2025-05-07T20:33:41.4606036Z     contiguous=False,
2025-05-07T20:33:41.4606118Z     compiled=True,
2025-05-07T20:33:41.4606191Z )
2025-05-07T20:33:41.4606411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4606588Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4606593Z 
2025-05-07T20:33:41.4606669Z     @given(
2025-05-07T20:33:41.4606790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4606890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4607008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4607130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4607246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4607324Z     )
2025-05-07T20:33:41.4607565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4607657Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4607736Z         self,
2025-05-07T20:33:41.4607812Z         T: int,
2025-05-07T20:33:41.4607889Z         D: int,
2025-05-07T20:33:41.4607990Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4608077Z         contiguous: bool,
2025-05-07T20:33:41.4608162Z         compiled: bool,
2025-05-07T20:33:41.4608242Z     ) -> None:
2025-05-07T20:33:41.4608335Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4608407Z     
2025-05-07T20:33:41.4608578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4608697Z     
2025-05-07T20:33:41.4608794Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4608916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4609005Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4609087Z         x0 = x[:, :D]
2025-05-07T20:33:41.4609165Z         x1 = x[:, D:]
2025-05-07T20:33:41.4609238Z     
2025-05-07T20:33:41.4609323Z         if contiguous:
2025-05-07T20:33:41.4609412Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4609498Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4609572Z     
2025-05-07T20:33:41.4609662Z         if scale_ub is not None:
2025-05-07T20:33:41.4609765Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4609902Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4609977Z             )
2025-05-07T20:33:41.4610057Z         else:
2025-05-07T20:33:41.4610150Z             scale_ub_tensor = None
2025-05-07T20:33:41.4610226Z     
2025-05-07T20:33:41.4610399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4610496Z             op = silu_mul_quant
2025-05-07T20:33:41.4610581Z             if compiled:
2025-05-07T20:33:41.4610723Z                 op = torch.compile(op)
2025-05-07T20:33:41.4610829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4610901Z     
2025-05-07T20:33:41.4610996Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4611000Z 
2025-05-07T20:33:41.4611097Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4611227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4611325Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4611421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4611788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4611879Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4612409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4612511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4612867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4613090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4613426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4613518Z     kernel = self.compile(
2025-05-07T20:33:41.4613899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4614069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4614194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4614198Z 
2025-05-07T20:33:41.4614408Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cdb20>
2025-05-07T20:33:41.4615184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4615693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a65c0>}
2025-05-07T20:33:41.4616434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4616626Z context = <triton._C.libtriton.ir.context object at 0x7f2bfb012f70>
2025-05-07T20:33:41.4616630Z 
2025-05-07T20:33:41.4616793Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4617054Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4617204Z                            module_map=module_map)
2025-05-07T20:33:41.4617366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4617465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4617544Z E       ^
2025-05-07T20:33:41.4617900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4617904Z 
2025-05-07T20:33:41.4618316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4618320Z 
2025-05-07T20:33:41.4618422Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4618642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4618724Z     T=1,
2025-05-07T20:33:41.4618800Z     D=7168,
2025-05-07T20:33:41.4618922Z     scale_ub=None,
2025-05-07T20:33:41.4619016Z     contiguous=False,
2025-05-07T20:33:41.4619099Z     compiled=False,
2025-05-07T20:33:41.4619174Z )
2025-05-07T20:33:41.4619393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4619602Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4619607Z 
2025-05-07T20:33:41.4619688Z     @given(
2025-05-07T20:33:41.4619807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4619904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4620021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4620135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4620252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4620325Z     )
2025-05-07T20:33:41.4620564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4620665Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4620743Z         self,
2025-05-07T20:33:41.4620860Z         T: int,
2025-05-07T20:33:41.4620940Z         D: int,
2025-05-07T20:33:41.4621036Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4621127Z         contiguous: bool,
2025-05-07T20:33:41.4621213Z         compiled: bool,
2025-05-07T20:33:41.4621289Z     ) -> None:
2025-05-07T20:33:41.4621383Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4621458Z     
2025-05-07T20:33:41.4621623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4621697Z     
2025-05-07T20:33:41.4621788Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4621912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4622001Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4622082Z         x0 = x[:, :D]
2025-05-07T20:33:41.4622160Z         x1 = x[:, D:]
2025-05-07T20:33:41.4622235Z     
2025-05-07T20:33:41.4622316Z         if contiguous:
2025-05-07T20:33:41.4622409Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4622508Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4622580Z     
2025-05-07T20:33:41.4622669Z         if scale_ub is not None:
2025-05-07T20:33:41.4622780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4622912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4622991Z             )
2025-05-07T20:33:41.4623066Z         else:
2025-05-07T20:33:41.4623159Z             scale_ub_tensor = None
2025-05-07T20:33:41.4623234Z     
2025-05-07T20:33:41.4623363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4623451Z             op = silu_mul_quant
2025-05-07T20:33:41.4623537Z             if compiled:
2025-05-07T20:33:41.4623635Z                 op = torch.compile(op)
2025-05-07T20:33:41.4623739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4623813Z     
2025-05-07T20:33:41.4623902Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4623906Z 
2025-05-07T20:33:41.4624005Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4624188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4624287Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4624389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4624882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4624978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4625337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4625556Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4625895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4625987Z     kernel = self.compile(
2025-05-07T20:33:41.4626408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4626590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4626779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4626784Z 
2025-05-07T20:33:41.4626987Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa9cf1d0>
2025-05-07T20:33:41.4627762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4628264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa9a71a0>}
2025-05-07T20:33:41.4629049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4629241Z context = <triton._C.libtriton.ir.context object at 0x7f2bfaccf9b0>
2025-05-07T20:33:41.4629248Z 
2025-05-07T20:33:41.4629413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4629672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4629777Z                            module_map=module_map)
2025-05-07T20:33:41.4629939Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4630037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4630112Z E       ^
2025-05-07T20:33:41.4630466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4630471Z 
2025-05-07T20:33:41.4630880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4630886Z 
2025-05-07T20:33:41.4630994Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4631212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4631290Z     T=2048,
2025-05-07T20:33:41.4631370Z     D=7168,
2025-05-07T20:33:41.4631451Z     scale_ub=None,
2025-05-07T20:33:41.4631534Z     contiguous=False,
2025-05-07T20:33:41.4631618Z     compiled=True,
2025-05-07T20:33:41.4631691Z )
2025-05-07T20:33:41.4631908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4632080Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4632084Z 
2025-05-07T20:33:41.4632160Z     @given(
2025-05-07T20:33:41.4632278Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4632377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4632492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4632611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4632770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4632844Z     )
2025-05-07T20:33:41.4633088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4633180Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4633257Z         self,
2025-05-07T20:33:41.4633333Z         T: int,
2025-05-07T20:33:41.4633409Z         D: int,
2025-05-07T20:33:41.4633510Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4633600Z         contiguous: bool,
2025-05-07T20:33:41.4633683Z         compiled: bool,
2025-05-07T20:33:41.4633762Z     ) -> None:
2025-05-07T20:33:41.4633856Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4633927Z     
2025-05-07T20:33:41.4634097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4634170Z     
2025-05-07T20:33:41.4634261Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4634431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4634526Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4634610Z         x0 = x[:, :D]
2025-05-07T20:33:41.4634730Z         x1 = x[:, D:]
2025-05-07T20:33:41.4634803Z     
2025-05-07T20:33:41.4634889Z         if contiguous:
2025-05-07T20:33:41.4634978Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4635065Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4635143Z     
2025-05-07T20:33:41.4635232Z         if scale_ub is not None:
2025-05-07T20:33:41.4635335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4635472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4635547Z             )
2025-05-07T20:33:41.4635623Z         else:
2025-05-07T20:33:41.4635771Z             scale_ub_tensor = None
2025-05-07T20:33:41.4635844Z     
2025-05-07T20:33:41.4635972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4636069Z             op = silu_mul_quant
2025-05-07T20:33:41.4636153Z             if compiled:
2025-05-07T20:33:41.4636300Z                 op = torch.compile(op)
2025-05-07T20:33:41.4636405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4636479Z     
2025-05-07T20:33:41.4636572Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4636576Z 
2025-05-07T20:33:41.4636671Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4636799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4636901Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4636998Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4637361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4637457Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4637944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4638046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4638405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4638626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4638962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4639054Z     kernel = self.compile(
2025-05-07T20:33:41.4639432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4639607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4639733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4639737Z 
2025-05-07T20:33:41.4639942Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd4830>
2025-05-07T20:33:41.4640720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4641274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac807c0>}
2025-05-07T20:33:41.4642013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4642202Z context = <triton._C.libtriton.ir.context object at 0x7f2bfac4dd30>
2025-05-07T20:33:41.4642210Z 
2025-05-07T20:33:41.4642372Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4642631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4642780Z                            module_map=module_map)
2025-05-07T20:33:41.4642947Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4643045Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4643161Z E       ^
2025-05-07T20:33:41.4643513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4643517Z 
2025-05-07T20:33:41.4643928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4643933Z 
2025-05-07T20:33:41.4644034Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4644254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4644335Z     T=4096,
2025-05-07T20:33:41.4644411Z     D=7168,
2025-05-07T20:33:41.4644491Z     scale_ub=None,
2025-05-07T20:33:41.4644578Z     contiguous=False,
2025-05-07T20:33:41.4644664Z     compiled=True,
2025-05-07T20:33:41.4644738Z )
2025-05-07T20:33:41.4644998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4645172Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4645179Z 
2025-05-07T20:33:41.4645258Z     @given(
2025-05-07T20:33:41.4645376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4645473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4645588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4645706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4645819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4645897Z     )
2025-05-07T20:33:41.4646135Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4646227Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4646306Z         self,
2025-05-07T20:33:41.4646381Z         T: int,
2025-05-07T20:33:41.4646461Z         D: int,
2025-05-07T20:33:41.4646561Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4646650Z         contiguous: bool,
2025-05-07T20:33:41.4646737Z         compiled: bool,
2025-05-07T20:33:41.4646818Z     ) -> None:
2025-05-07T20:33:41.4646910Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4646985Z     
2025-05-07T20:33:41.4647149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4647221Z     
2025-05-07T20:33:41.4647314Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4647435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4647524Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4647605Z         x0 = x[:, :D]
2025-05-07T20:33:41.4647684Z         x1 = x[:, D:]
2025-05-07T20:33:41.4647759Z     
2025-05-07T20:33:41.4647840Z         if contiguous:
2025-05-07T20:33:41.4647930Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4648019Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4648091Z     
2025-05-07T20:33:41.4648183Z         if scale_ub is not None:
2025-05-07T20:33:41.4648336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4648467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4648543Z             )
2025-05-07T20:33:41.4648619Z         else:
2025-05-07T20:33:41.4648711Z             scale_ub_tensor = None
2025-05-07T20:33:41.4648783Z     
2025-05-07T20:33:41.4648913Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4649001Z             op = silu_mul_quant
2025-05-07T20:33:41.4649083Z             if compiled:
2025-05-07T20:33:41.4649186Z                 op = torch.compile(op)
2025-05-07T20:33:41.4649289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4649363Z     
2025-05-07T20:33:41.4649453Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4649457Z 
2025-05-07T20:33:41.4649551Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4649727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4649831Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4649932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4650300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4650433Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4650924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4651021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4651376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4651597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4651931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4652026Z     kernel = self.compile(
2025-05-07T20:33:41.4652452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4652628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4652759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4652763Z 
2025-05-07T20:33:41.4652965Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd5100>
2025-05-07T20:33:41.4653737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4654241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac814e0>}
2025-05-07T20:33:41.4654986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4655183Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbbabcf0>
2025-05-07T20:33:41.4655187Z 
2025-05-07T20:33:41.4655349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4655612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4655718Z                            module_map=module_map)
2025-05-07T20:33:41.4655876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4655980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4656056Z E       ^
2025-05-07T20:33:41.4656407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4656412Z 
2025-05-07T20:33:41.4656831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4656877Z 
2025-05-07T20:33:41.4656979Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4657206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4657282Z     T=16384,
2025-05-07T20:33:41.4657359Z     D=5120,
2025-05-07T20:33:41.4657443Z     scale_ub=1200.0,
2025-05-07T20:33:41.4657528Z     contiguous=False,
2025-05-07T20:33:41.4657611Z     compiled=False,
2025-05-07T20:33:41.4657691Z )
2025-05-07T20:33:41.4657905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4658083Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4658092Z 
2025-05-07T20:33:41.4658168Z     @given(
2025-05-07T20:33:41.4658284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4658387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4658566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4658687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4658803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4658918Z     )
2025-05-07T20:33:41.4659158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4659254Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4659329Z         self,
2025-05-07T20:33:41.4659406Z         T: int,
2025-05-07T20:33:41.4659485Z         D: int,
2025-05-07T20:33:41.4659582Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4659672Z         contiguous: bool,
2025-05-07T20:33:41.4659755Z         compiled: bool,
2025-05-07T20:33:41.4659831Z     ) -> None:
2025-05-07T20:33:41.4659925Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4659995Z     
2025-05-07T20:33:41.4660160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4660239Z     
2025-05-07T20:33:41.4660331Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4660497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4660590Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4660673Z         x0 = x[:, :D]
2025-05-07T20:33:41.4660752Z         x1 = x[:, D:]
2025-05-07T20:33:41.4660825Z     
2025-05-07T20:33:41.4660906Z         if contiguous:
2025-05-07T20:33:41.4660996Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4661086Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4661157Z     
2025-05-07T20:33:41.4661248Z         if scale_ub is not None:
2025-05-07T20:33:41.4661350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4661482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4661559Z             )
2025-05-07T20:33:41.4661633Z         else:
2025-05-07T20:33:41.4661725Z             scale_ub_tensor = None
2025-05-07T20:33:41.4661800Z     
2025-05-07T20:33:41.4661929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4662019Z             op = silu_mul_quant
2025-05-07T20:33:41.4662108Z             if compiled:
2025-05-07T20:33:41.4662206Z                 op = torch.compile(op)
2025-05-07T20:33:41.4662312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4662386Z     
2025-05-07T20:33:41.4662476Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4662480Z 
2025-05-07T20:33:41.4662582Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4662709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4662808Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4662909Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4663403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4663499Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4663858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4664127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4664464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4664558Z     kernel = self.compile(
2025-05-07T20:33:41.4664938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4665111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4665235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4665239Z 
2025-05-07T20:33:41.4665644Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd7b60>
2025-05-07T20:33:41.4666562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4667074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac823e0>}
2025-05-07T20:33:41.4667879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4668069Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbbc4cb0>
2025-05-07T20:33:41.4668074Z 
2025-05-07T20:33:41.4668238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4668496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4668602Z                            module_map=module_map)
2025-05-07T20:33:41.4668766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4668924Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4669004Z E       ^
2025-05-07T20:33:41.4669357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4669365Z 
2025-05-07T20:33:41.4669772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4669776Z 
2025-05-07T20:33:41.4669884Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4670103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4670187Z     T=16384,
2025-05-07T20:33:41.4670263Z     D=5120,
2025-05-07T20:33:41.4670344Z     scale_ub=1200.0,
2025-05-07T20:33:41.4670430Z     contiguous=True,
2025-05-07T20:33:41.4670516Z     compiled=True,
2025-05-07T20:33:41.4670589Z )
2025-05-07T20:33:41.4670810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4670988Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4670992Z 
2025-05-07T20:33:41.4671069Z     @given(
2025-05-07T20:33:41.4671193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4671290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4671403Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4671522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4671634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4671710Z     )
2025-05-07T20:33:41.4671950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4672043Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4672121Z         self,
2025-05-07T20:33:41.4672196Z         T: int,
2025-05-07T20:33:41.4672271Z         D: int,
2025-05-07T20:33:41.4672370Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4672460Z         contiguous: bool,
2025-05-07T20:33:41.4672608Z         compiled: bool,
2025-05-07T20:33:41.4672696Z     ) -> None:
2025-05-07T20:33:41.4672790Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4672864Z     
2025-05-07T20:33:41.4673035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4673108Z     
2025-05-07T20:33:41.4673203Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4673325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4673412Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4673494Z         x0 = x[:, :D]
2025-05-07T20:33:41.4673573Z         x1 = x[:, D:]
2025-05-07T20:33:41.4673643Z     
2025-05-07T20:33:41.4673727Z         if contiguous:
2025-05-07T20:33:41.4673817Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4673903Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4673978Z     
2025-05-07T20:33:41.4674068Z         if scale_ub is not None:
2025-05-07T20:33:41.4674216Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4674357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4674434Z             )
2025-05-07T20:33:41.4674512Z         else:
2025-05-07T20:33:41.4674644Z             scale_ub_tensor = None
2025-05-07T20:33:41.4674715Z     
2025-05-07T20:33:41.4674845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4674933Z             op = silu_mul_quant
2025-05-07T20:33:41.4675018Z             if compiled:
2025-05-07T20:33:41.4675119Z                 op = torch.compile(op)
2025-05-07T20:33:41.4675223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4675295Z     
2025-05-07T20:33:41.4675390Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4675394Z 
2025-05-07T20:33:41.4675490Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4675616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4675789Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4678997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4679456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4679558Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4680056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4680158Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4680515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4680737Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4681079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4681172Z     kernel = self.compile(
2025-05-07T20:33:41.4681559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4681739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4681867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4681874Z 
2025-05-07T20:33:41.4682084Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfacd7bf0>
2025-05-07T20:33:41.4682859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4683366Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfac83a60>}
2025-05-07T20:33:41.4684111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4684362Z context = <triton._C.libtriton.ir.context object at 0x7f2bfbba2c70>
2025-05-07T20:33:41.4684369Z 
2025-05-07T20:33:41.4684539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4684802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4684915Z                            module_map=module_map)
2025-05-07T20:33:41.4685076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4685177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4685261Z E       ^
2025-05-07T20:33:41.4685614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4685618Z 
2025-05-07T20:33:41.4686032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4686078Z 
2025-05-07T20:33:41.4686188Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4686412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4686604Z     T=16384,
2025-05-07T20:33:41.4686681Z     D=5120,
2025-05-07T20:33:41.4686764Z     scale_ub=None,
2025-05-07T20:33:41.4686854Z     contiguous=False,
2025-05-07T20:33:41.4686938Z     compiled=True,
2025-05-07T20:33:41.4687012Z )
2025-05-07T20:33:41.4687232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4687408Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4687413Z 
2025-05-07T20:33:41.4687493Z     @given(
2025-05-07T20:33:41.4687612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4687714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4687832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4687952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4688110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4688188Z     )
2025-05-07T20:33:41.4688432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4688532Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4688609Z         self,
2025-05-07T20:33:41.4688687Z         T: int,
2025-05-07T20:33:41.4688769Z         D: int,
2025-05-07T20:33:41.4688868Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4688958Z         contiguous: bool,
2025-05-07T20:33:41.4689049Z         compiled: bool,
2025-05-07T20:33:41.4689130Z     ) -> None:
2025-05-07T20:33:41.4689227Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4689305Z     
2025-05-07T20:33:41.4689472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4689547Z     
2025-05-07T20:33:41.4689645Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4689772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4689862Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4689954Z         x0 = x[:, :D]
2025-05-07T20:33:41.4690035Z         x1 = x[:, D:]
2025-05-07T20:33:41.4690113Z     
2025-05-07T20:33:41.4690198Z         if contiguous:
2025-05-07T20:33:41.4690289Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4690383Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4690456Z     
2025-05-07T20:33:41.4690546Z         if scale_ub is not None:
2025-05-07T20:33:41.4690655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4690788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4690864Z             )
2025-05-07T20:33:41.4690943Z         else:
2025-05-07T20:33:41.4691039Z             scale_ub_tensor = None
2025-05-07T20:33:41.4691115Z     
2025-05-07T20:33:41.4691247Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4691339Z             op = silu_mul_quant
2025-05-07T20:33:41.4691427Z             if compiled:
2025-05-07T20:33:41.4691527Z                 op = torch.compile(op)
2025-05-07T20:33:41.4691705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4691782Z     
2025-05-07T20:33:41.4691877Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4691882Z 
2025-05-07T20:33:41.4691978Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4692110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4692211Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4692309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4692678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4692773Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4693264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4693361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4693760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4693989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4694366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4694460Z     kernel = self.compile(
2025-05-07T20:33:41.4694842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4695015Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4695149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4695153Z 
2025-05-07T20:33:41.4695359Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb4a6f0>
2025-05-07T20:33:41.4696176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4696689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb88cc0>}
2025-05-07T20:33:41.4697435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4697629Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab8b5f0>
2025-05-07T20:33:41.4697633Z 
2025-05-07T20:33:41.4697798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4698061Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4698171Z                            module_map=module_map)
2025-05-07T20:33:41.4698336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4698439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4698518Z E       ^
2025-05-07T20:33:41.4698874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4698879Z 
2025-05-07T20:33:41.4699292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4699296Z 
2025-05-07T20:33:41.4699400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4699641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4699731Z     T=2048,
2025-05-07T20:33:41.4699822Z     D=5120,
2025-05-07T20:33:41.4699922Z     scale_ub=None,
2025-05-07T20:33:41.4700009Z     contiguous=False,
2025-05-07T20:33:41.4700092Z     compiled=True,
2025-05-07T20:33:41.4700170Z )
2025-05-07T20:33:41.4700390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4700610Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4700621Z 
2025-05-07T20:33:41.4700699Z     @given(
2025-05-07T20:33:41.4700816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4700922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4701035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4701152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4701269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4701345Z     )
2025-05-07T20:33:41.4701584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4701681Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4701756Z         self,
2025-05-07T20:33:41.4701832Z         T: int,
2025-05-07T20:33:41.4701912Z         D: int,
2025-05-07T20:33:41.4702051Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4702149Z         contiguous: bool,
2025-05-07T20:33:41.4702234Z         compiled: bool,
2025-05-07T20:33:41.4702311Z     ) -> None:
2025-05-07T20:33:41.4702460Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4702533Z     
2025-05-07T20:33:41.4702699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4702774Z     
2025-05-07T20:33:41.4702863Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4702986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4703078Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4703157Z         x0 = x[:, :D]
2025-05-07T20:33:41.4703237Z         x1 = x[:, D:]
2025-05-07T20:33:41.4703311Z     
2025-05-07T20:33:41.4703395Z         if contiguous:
2025-05-07T20:33:41.4703489Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4703577Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4703651Z     
2025-05-07T20:33:41.4703747Z         if scale_ub is not None:
2025-05-07T20:33:41.4703852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4704030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4704108Z             )
2025-05-07T20:33:41.4704187Z         else:
2025-05-07T20:33:41.4704280Z             scale_ub_tensor = None
2025-05-07T20:33:41.4704354Z     
2025-05-07T20:33:41.4704481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4704570Z             op = silu_mul_quant
2025-05-07T20:33:41.4704657Z             if compiled:
2025-05-07T20:33:41.4704756Z                 op = torch.compile(op)
2025-05-07T20:33:41.4704864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4704935Z     
2025-05-07T20:33:41.4705025Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4705029Z 
2025-05-07T20:33:41.4705127Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4705254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4705357Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4705462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4705824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4705919Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4706411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4706507Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4706864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4707083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4707418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4707514Z     kernel = self.compile(
2025-05-07T20:33:41.4707894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4708113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4708244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4708248Z 
2025-05-07T20:33:41.4708451Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb48920>
2025-05-07T20:33:41.4709226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4709727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb89a80>}
2025-05-07T20:33:41.4710564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4710759Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab67eb0>
2025-05-07T20:33:41.4710802Z 
2025-05-07T20:33:41.4710969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4711233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4711342Z                            module_map=module_map)
2025-05-07T20:33:41.4711503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4711603Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4711680Z E       ^
2025-05-07T20:33:41.4712034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4712038Z 
2025-05-07T20:33:41.4712453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4712497Z 
2025-05-07T20:33:41.4712604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4712825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4712906Z     T=2048,
2025-05-07T20:33:41.4712988Z     D=5120,
2025-05-07T20:33:41.4713072Z     scale_ub=1200.0,
2025-05-07T20:33:41.4713158Z     contiguous=False,
2025-05-07T20:33:41.4713245Z     compiled=True,
2025-05-07T20:33:41.4713319Z )
2025-05-07T20:33:41.4713536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4713713Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4713718Z 
2025-05-07T20:33:41.4713797Z     @given(
2025-05-07T20:33:41.4713917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4714015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4714131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4714256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4714369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4714446Z     )
2025-05-07T20:33:41.4714689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4714781Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4714860Z         self,
2025-05-07T20:33:41.4714941Z         T: int,
2025-05-07T20:33:41.4715018Z         D: int,
2025-05-07T20:33:41.4715117Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4715209Z         contiguous: bool,
2025-05-07T20:33:41.4715295Z         compiled: bool,
2025-05-07T20:33:41.4715377Z     ) -> None:
2025-05-07T20:33:41.4715471Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4715545Z     
2025-05-07T20:33:41.4715776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4715852Z     
2025-05-07T20:33:41.4715947Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4716074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4716214Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4716295Z         x0 = x[:, :D]
2025-05-07T20:33:41.4716379Z         x1 = x[:, D:]
2025-05-07T20:33:41.4716453Z     
2025-05-07T20:33:41.4716539Z         if contiguous:
2025-05-07T20:33:41.4716636Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4716723Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4716799Z     
2025-05-07T20:33:41.4716890Z         if scale_ub is not None:
2025-05-07T20:33:41.4716994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4717129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4717205Z             )
2025-05-07T20:33:41.4717281Z         else:
2025-05-07T20:33:41.4717379Z             scale_ub_tensor = None
2025-05-07T20:33:41.4717455Z     
2025-05-07T20:33:41.4717582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4717721Z             op = silu_mul_quant
2025-05-07T20:33:41.4717811Z             if compiled:
2025-05-07T20:33:41.4717912Z                 op = torch.compile(op)
2025-05-07T20:33:41.4718019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4718134Z     
2025-05-07T20:33:41.4718226Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4718233Z 
2025-05-07T20:33:41.4718330Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4718458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4718561Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4718660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4719022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4719118Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4719604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4719706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4720124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4720349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4720688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4720781Z     kernel = self.compile(
2025-05-07T20:33:41.4721158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4721332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4721459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4721464Z 
2025-05-07T20:33:41.4721670Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfbb4b260>
2025-05-07T20:33:41.4722448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4722953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfbb8ac00>}
2025-05-07T20:33:41.4723694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4723883Z context = <triton._C.libtriton.ir.context object at 0x7f2bfab4c130>
2025-05-07T20:33:41.4723887Z 
2025-05-07T20:33:41.4724050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4724313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4724420Z                            module_map=module_map)
2025-05-07T20:33:41.4724629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4724731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4724812Z E       ^
2025-05-07T20:33:41.4725164Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4725168Z 
2025-05-07T20:33:41.4725575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4725579Z 
2025-05-07T20:33:41.4725684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4725906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4725989Z     T=4096,
2025-05-07T20:33:41.4726066Z     D=5120,
2025-05-07T20:33:41.4726149Z     scale_ub=1200.0,
2025-05-07T20:33:41.4726238Z     contiguous=True,
2025-05-07T20:33:41.4726360Z     compiled=True,
2025-05-07T20:33:41.4726438Z )
2025-05-07T20:33:41.4726661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4726831Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4726875Z 
2025-05-07T20:33:41.4726955Z     @given(
2025-05-07T20:33:41.4727075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4727173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4727290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4727406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4727520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4727599Z     )
2025-05-07T20:33:41.4727840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4727935Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4728020Z         self,
2025-05-07T20:33:41.4728099Z         T: int,
2025-05-07T20:33:41.4728180Z         D: int,
2025-05-07T20:33:41.4728325Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4728415Z         contiguous: bool,
2025-05-07T20:33:41.4728500Z         compiled: bool,
2025-05-07T20:33:41.4728591Z     ) -> None:
2025-05-07T20:33:41.4728686Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4728766Z     
2025-05-07T20:33:41.4728932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4729007Z     
2025-05-07T20:33:41.4729103Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4729228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4729317Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4729406Z         x0 = x[:, :D]
2025-05-07T20:33:41.4729486Z         x1 = x[:, D:]
2025-05-07T20:33:41.4729559Z     
2025-05-07T20:33:41.4729646Z         if contiguous:
2025-05-07T20:33:41.4729737Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4729830Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4729909Z     
2025-05-07T20:33:41.4729999Z         if scale_ub is not None:
2025-05-07T20:33:41.4730110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4730247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4730329Z             )
2025-05-07T20:33:41.4730408Z         else:
2025-05-07T20:33:41.4730502Z             scale_ub_tensor = None
2025-05-07T20:33:41.4730576Z     
2025-05-07T20:33:41.4730708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4730798Z             op = silu_mul_quant
2025-05-07T20:33:41.4730883Z             if compiled:
2025-05-07T20:33:41.4730984Z                 op = torch.compile(op)
2025-05-07T20:33:41.4731089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4731161Z     
2025-05-07T20:33:41.4731260Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4731265Z 
2025-05-07T20:33:41.4731360Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4731496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4731646Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4731745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4732108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4732203Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4732691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4732789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4733146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4733370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4733705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4733845Z     kernel = self.compile(
2025-05-07T20:33:41.4734230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4734402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4734569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4734577Z 
2025-05-07T20:33:41.4734784Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83c8c0>
2025-05-07T20:33:41.4735555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4736060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa828220>}
2025-05-07T20:33:41.4736842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4737042Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa876a70>
2025-05-07T20:33:41.4737047Z 
2025-05-07T20:33:41.4737208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4737467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4737578Z                            module_map=module_map)
2025-05-07T20:33:41.4737737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4737838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4737916Z E       ^
2025-05-07T20:33:41.4738265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4738270Z 
2025-05-07T20:33:41.4738687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4738694Z 
2025-05-07T20:33:41.4738795Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4739019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4739098Z     T=128,
2025-05-07T20:33:41.4739175Z     D=5120,
2025-05-07T20:33:41.4739259Z     scale_ub=1200.0,
2025-05-07T20:33:41.4739345Z     contiguous=False,
2025-05-07T20:33:41.4739427Z     compiled=True,
2025-05-07T20:33:41.4739504Z )
2025-05-07T20:33:41.4739720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4739892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4739896Z 
2025-05-07T20:33:41.4739976Z     @given(
2025-05-07T20:33:41.4740096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4740196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4740316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4740481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4740596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4740674Z     )
2025-05-07T20:33:41.4740915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4741013Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4741093Z         self,
2025-05-07T20:33:41.4741170Z         T: int,
2025-05-07T20:33:41.4741251Z         D: int,
2025-05-07T20:33:41.4741350Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4741439Z         contiguous: bool,
2025-05-07T20:33:41.4741530Z         compiled: bool,
2025-05-07T20:33:41.4741608Z     ) -> None:
2025-05-07T20:33:41.4741706Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4741780Z     
2025-05-07T20:33:41.4741946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4742072Z     
2025-05-07T20:33:41.4742166Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4742294Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4742386Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4742505Z         x0 = x[:, :D]
2025-05-07T20:33:41.4742586Z         x1 = x[:, D:]
2025-05-07T20:33:41.4742662Z     
2025-05-07T20:33:41.4742745Z         if contiguous:
2025-05-07T20:33:41.4742837Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4742931Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4743003Z     
2025-05-07T20:33:41.4743094Z         if scale_ub is not None:
2025-05-07T20:33:41.4743203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4743336Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4743414Z             )
2025-05-07T20:33:41.4743495Z         else:
2025-05-07T20:33:41.4743589Z             scale_ub_tensor = None
2025-05-07T20:33:41.4743664Z     
2025-05-07T20:33:41.4743795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4743929Z             op = silu_mul_quant
2025-05-07T20:33:41.4744022Z             if compiled:
2025-05-07T20:33:41.4744122Z                 op = torch.compile(op)
2025-05-07T20:33:41.4744228Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4744305Z     
2025-05-07T20:33:41.4744397Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4744401Z 
2025-05-07T20:33:41.4744504Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4744631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4744732Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4744835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4745199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4745291Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4745783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4745885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4746243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4746465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4746799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4746897Z     kernel = self.compile(
2025-05-07T20:33:41.4747278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4747450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4747579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4747583Z 
2025-05-07T20:33:41.4747789Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83d190>
2025-05-07T20:33:41.4748614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4749118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa828f40>}
2025-05-07T20:33:41.4749862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4750053Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa89f830>
2025-05-07T20:33:41.4750057Z 
2025-05-07T20:33:41.4750219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4750524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4750638Z                            module_map=module_map)
2025-05-07T20:33:41.4750804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4750967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4751046Z E       ^
2025-05-07T20:33:41.4751400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4751405Z 
2025-05-07T20:33:41.4751812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4751816Z 
2025-05-07T20:33:41.4751919Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4752142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4752222Z     T=16384,
2025-05-07T20:33:41.4752302Z     D=7168,
2025-05-07T20:33:41.4752386Z     scale_ub=1200.0,
2025-05-07T20:33:41.4752474Z     contiguous=True,
2025-05-07T20:33:41.4752603Z     compiled=True,
2025-05-07T20:33:41.4752677Z )
2025-05-07T20:33:41.4752893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4753074Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4753079Z 
2025-05-07T20:33:41.4753157Z     @given(
2025-05-07T20:33:41.4753275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4753379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4753495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4753615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4753727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4753806Z     )
2025-05-07T20:33:41.4754050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4754145Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4754224Z         self,
2025-05-07T20:33:41.4754305Z         T: int,
2025-05-07T20:33:41.4754385Z         D: int,
2025-05-07T20:33:41.4754483Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4754578Z         contiguous: bool,
2025-05-07T20:33:41.4754663Z         compiled: bool,
2025-05-07T20:33:41.4754742Z     ) -> None:
2025-05-07T20:33:41.4754839Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4754912Z     
2025-05-07T20:33:41.4755081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4755156Z     
2025-05-07T20:33:41.4755249Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4755377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4755467Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4755547Z         x0 = x[:, :D]
2025-05-07T20:33:41.4755631Z         x1 = x[:, D:]
2025-05-07T20:33:41.4755750Z     
2025-05-07T20:33:41.4755837Z         if contiguous:
2025-05-07T20:33:41.4755933Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4756025Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4756146Z     
2025-05-07T20:33:41.4756242Z         if scale_ub is not None:
2025-05-07T20:33:41.4756347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4756488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4756565Z             )
2025-05-07T20:33:41.4756642Z         else:
2025-05-07T20:33:41.4756739Z             scale_ub_tensor = None
2025-05-07T20:33:41.4756812Z     
2025-05-07T20:33:41.4756940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4757035Z             op = silu_mul_quant
2025-05-07T20:33:41.4757120Z             if compiled:
2025-05-07T20:33:41.4757218Z                 op = torch.compile(op)
2025-05-07T20:33:41.4757325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4757398Z     
2025-05-07T20:33:41.4757490Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4757494Z 
2025-05-07T20:33:41.4757597Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4757772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4757884Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4757982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4758384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4758481Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4758966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4759062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4759418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4759636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4759976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4760111Z     kernel = self.compile(
2025-05-07T20:33:41.4760489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4760669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4760795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4760800Z 
2025-05-07T20:33:41.4761006Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83ea80>
2025-05-07T20:33:41.4761780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4762285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa82a160>}
2025-05-07T20:33:41.4763031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4763222Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa601170>
2025-05-07T20:33:41.4763226Z 
2025-05-07T20:33:41.4763390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4763650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4763759Z                            module_map=module_map)
2025-05-07T20:33:41.4763923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4764021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4764103Z E       ^
2025-05-07T20:33:41.4764457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4764502Z 
2025-05-07T20:33:41.4764912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4764918Z 
2025-05-07T20:33:41.4765023Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4765244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4765323Z     T=16384,
2025-05-07T20:33:41.4765626Z     D=5120,
2025-05-07T20:33:41.4765752Z     scale_ub=1200.0,
2025-05-07T20:33:41.4765860Z     contiguous=True,
2025-05-07T20:33:41.4765946Z     compiled=False,
2025-05-07T20:33:41.4766017Z )
2025-05-07T20:33:41.4766236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4766411Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4766415Z 
2025-05-07T20:33:41.4766490Z     @given(
2025-05-07T20:33:41.4766719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4766821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4766937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4767055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4767231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4767310Z     )
2025-05-07T20:33:41.4767552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4767644Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4767722Z         self,
2025-05-07T20:33:41.4767799Z         T: int,
2025-05-07T20:33:41.4767873Z         D: int,
2025-05-07T20:33:41.4767972Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4768059Z         contiguous: bool,
2025-05-07T20:33:41.4768144Z         compiled: bool,
2025-05-07T20:33:41.4768225Z     ) -> None:
2025-05-07T20:33:41.4768318Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4768390Z     
2025-05-07T20:33:41.4768561Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4768635Z     
2025-05-07T20:33:41.4768789Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4768915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4769005Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4769087Z         x0 = x[:, :D]
2025-05-07T20:33:41.4769167Z         x1 = x[:, D:]
2025-05-07T20:33:41.4769237Z     
2025-05-07T20:33:41.4769322Z         if contiguous:
2025-05-07T20:33:41.4769411Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4769498Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4769571Z     
2025-05-07T20:33:41.4769660Z         if scale_ub is not None:
2025-05-07T20:33:41.4769764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4769898Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4769972Z             )
2025-05-07T20:33:41.4770046Z         else:
2025-05-07T20:33:41.4770141Z             scale_ub_tensor = None
2025-05-07T20:33:41.4770217Z     
2025-05-07T20:33:41.4770351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4770439Z             op = silu_mul_quant
2025-05-07T20:33:41.4770523Z             if compiled:
2025-05-07T20:33:41.4770625Z                 op = torch.compile(op)
2025-05-07T20:33:41.4770729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4770800Z     
2025-05-07T20:33:41.4770891Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4770895Z 
2025-05-07T20:33:41.4770989Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4771117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4771218Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4771315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4771811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4771909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4772265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4772550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4772889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4772980Z     kernel = self.compile(
2025-05-07T20:33:41.4773359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4773531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4773659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4773664Z 
2025-05-07T20:33:41.4773867Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa83e630>
2025-05-07T20:33:41.4774684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4775232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa829b20>}
2025-05-07T20:33:41.4775974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4776169Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa6e2d30>
2025-05-07T20:33:41.4776173Z 
2025-05-07T20:33:41.4776335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4776598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4776707Z                            module_map=module_map)
2025-05-07T20:33:41.4776908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4777012Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4777092Z E       ^
2025-05-07T20:33:41.4777445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4777449Z 
2025-05-07T20:33:41.4777860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4777864Z 
2025-05-07T20:33:41.4777966Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4778188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4778267Z     T=1,
2025-05-07T20:33:41.4778345Z     D=7168,
2025-05-07T20:33:41.4778431Z     scale_ub=1200.0,
2025-05-07T20:33:41.4778518Z     contiguous=False,
2025-05-07T20:33:41.4778600Z     compiled=False,
2025-05-07T20:33:41.4778682Z )
2025-05-07T20:33:41.4778902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4779070Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4779081Z 
2025-05-07T20:33:41.4779158Z     @given(
2025-05-07T20:33:41.4779275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4779379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4779494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4779611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4779727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4779802Z     )
2025-05-07T20:33:41.4780043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4780141Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4780217Z         self,
2025-05-07T20:33:41.4780299Z         T: int,
2025-05-07T20:33:41.4780376Z         D: int,
2025-05-07T20:33:41.4780475Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4780617Z         contiguous: bool,
2025-05-07T20:33:41.4780703Z         compiled: bool,
2025-05-07T20:33:41.4780780Z     ) -> None:
2025-05-07T20:33:41.4780880Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4780954Z     
2025-05-07T20:33:41.4781119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4781196Z     
2025-05-07T20:33:41.4781287Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4781410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4781502Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4781584Z         x0 = x[:, :D]
2025-05-07T20:33:41.4781665Z         x1 = x[:, D:]
2025-05-07T20:33:41.4781739Z     
2025-05-07T20:33:41.4781822Z         if contiguous:
2025-05-07T20:33:41.4781916Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4782003Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4782075Z     
2025-05-07T20:33:41.4782240Z         if scale_ub is not None:
2025-05-07T20:33:41.4782348Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4782483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4782601Z             )
2025-05-07T20:33:41.4782678Z         else:
2025-05-07T20:33:41.4782771Z             scale_ub_tensor = None
2025-05-07T20:33:41.4782845Z     
2025-05-07T20:33:41.4782972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4783063Z             op = silu_mul_quant
2025-05-07T20:33:41.4783151Z             if compiled:
2025-05-07T20:33:41.4783248Z                 op = torch.compile(op)
2025-05-07T20:33:41.4783353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4783426Z     
2025-05-07T20:33:41.4783516Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4783520Z 
2025-05-07T20:33:41.4783619Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4783745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4783847Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4783991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4784483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4784583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4784942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4785160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4785496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4785589Z     kernel = self.compile(
2025-05-07T20:33:41.4785966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4786146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4786273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4786281Z 
2025-05-07T20:33:41.4786487Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635b50>
2025-05-07T20:33:41.4787258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4787763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61c180>}
2025-05-07T20:33:41.4788505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4788702Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa4f6cb0>
2025-05-07T20:33:41.4788748Z 
2025-05-07T20:33:41.4788916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4789175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4789282Z                            module_map=module_map)
2025-05-07T20:33:41.4789444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4789542Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4789623Z E       ^
2025-05-07T20:33:41.4789973Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4789978Z 
2025-05-07T20:33:41.4790385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4790389Z 
2025-05-07T20:33:41.4790497Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4790758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4790846Z     T=4096,
2025-05-07T20:33:41.4790922Z     D=7168,
2025-05-07T20:33:41.4791005Z     scale_ub=1200.0,
2025-05-07T20:33:41.4791134Z     contiguous=False,
2025-05-07T20:33:41.4791217Z     compiled=True,
2025-05-07T20:33:41.4791291Z )
2025-05-07T20:33:41.4791507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4791680Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4791685Z 
2025-05-07T20:33:41.4791762Z     @given(
2025-05-07T20:33:41.4791882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4791980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4792098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4792212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4792324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4792405Z     )
2025-05-07T20:33:41.4792689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4792784Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4792867Z         self,
2025-05-07T20:33:41.4792944Z         T: int,
2025-05-07T20:33:41.4793020Z         D: int,
2025-05-07T20:33:41.4793122Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4793210Z         contiguous: bool,
2025-05-07T20:33:41.4793298Z         compiled: bool,
2025-05-07T20:33:41.4793380Z     ) -> None:
2025-05-07T20:33:41.4793475Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4793552Z     
2025-05-07T20:33:41.4793718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4793791Z     
2025-05-07T20:33:41.4793885Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4794008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4794096Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4794181Z         x0 = x[:, :D]
2025-05-07T20:33:41.4794260Z         x1 = x[:, D:]
2025-05-07T20:33:41.4794337Z     
2025-05-07T20:33:41.4794425Z         if contiguous:
2025-05-07T20:33:41.4794516Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4794608Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4794685Z     
2025-05-07T20:33:41.4794775Z         if scale_ub is not None:
2025-05-07T20:33:41.4794880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4795015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4795092Z             )
2025-05-07T20:33:41.4795172Z         else:
2025-05-07T20:33:41.4795267Z             scale_ub_tensor = None
2025-05-07T20:33:41.4795339Z     
2025-05-07T20:33:41.4795468Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4795558Z             op = silu_mul_quant
2025-05-07T20:33:41.4795643Z             if compiled:
2025-05-07T20:33:41.4795821Z                 op = torch.compile(op)
2025-05-07T20:33:41.4795930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4796052Z     
2025-05-07T20:33:41.4796151Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4796156Z 
2025-05-07T20:33:41.4796253Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4796386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4796485Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4796581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4796947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4797037Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4797527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4797626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4798021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4798252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4798588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4798721Z     kernel = self.compile(
2025-05-07T20:33:41.4799101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4799271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4802477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4802486Z 
2025-05-07T20:33:41.4802708Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635490>
2025-05-07T20:33:41.4803493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4804068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61d3a0>}
2025-05-07T20:33:41.4804825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4805016Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa495d70>
2025-05-07T20:33:41.4805025Z 
2025-05-07T20:33:41.4805188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4805450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4805559Z                            module_map=module_map)
2025-05-07T20:33:41.4805718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4805820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4805906Z E       ^
2025-05-07T20:33:41.4806260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4806266Z 
2025-05-07T20:33:41.4806679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4806684Z 
2025-05-07T20:33:41.4806787Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4807011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4807090Z     T=128,
2025-05-07T20:33:41.4807168Z     D=7168,
2025-05-07T20:33:41.4807251Z     scale_ub=1200.0,
2025-05-07T20:33:41.4807339Z     contiguous=False,
2025-05-07T20:33:41.4807422Z     compiled=True,
2025-05-07T20:33:41.4807497Z )
2025-05-07T20:33:41.4807719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4807892Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.4807946Z 
2025-05-07T20:33:41.4808031Z     @given(
2025-05-07T20:33:41.4808149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4808251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4808369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4808484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4808595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4808673Z     )
2025-05-07T20:33:41.4808917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4809014Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4809091Z         self,
2025-05-07T20:33:41.4809167Z         T: int,
2025-05-07T20:33:41.4809246Z         D: int,
2025-05-07T20:33:41.4809343Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4809438Z         contiguous: bool,
2025-05-07T20:33:41.4809569Z         compiled: bool,
2025-05-07T20:33:41.4809656Z     ) -> None:
2025-05-07T20:33:41.4809752Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4809827Z     
2025-05-07T20:33:41.4809995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4810110Z     
2025-05-07T20:33:41.4810204Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4810327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4810415Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4810499Z         x0 = x[:, :D]
2025-05-07T20:33:41.4810579Z         x1 = x[:, D:]
2025-05-07T20:33:41.4810657Z     
2025-05-07T20:33:41.4810742Z         if contiguous:
2025-05-07T20:33:41.4810833Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4810925Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4810997Z     
2025-05-07T20:33:41.4811086Z         if scale_ub is not None:
2025-05-07T20:33:41.4811202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4811338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4811462Z             )
2025-05-07T20:33:41.4811543Z         else:
2025-05-07T20:33:41.4811636Z             scale_ub_tensor = None
2025-05-07T20:33:41.4811712Z     
2025-05-07T20:33:41.4811843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4811934Z             op = silu_mul_quant
2025-05-07T20:33:41.4812027Z             if compiled:
2025-05-07T20:33:41.4812126Z                 op = torch.compile(op)
2025-05-07T20:33:41.4812230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4812306Z     
2025-05-07T20:33:41.4812396Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4812400Z 
2025-05-07T20:33:41.4812496Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4812626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4812727Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4812826Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4813202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4813299Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4813795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4813891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4814245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4814468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4814806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4814902Z     kernel = self.compile(
2025-05-07T20:33:41.4815280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4815456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4815682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4815689Z 
2025-05-07T20:33:41.4815894Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa635070>
2025-05-07T20:33:41.4816671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4817178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61e0c0>}
2025-05-07T20:33:41.4818030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4818231Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa75fab0>
2025-05-07T20:33:41.4818235Z 
2025-05-07T20:33:41.4818398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4818701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4818807Z                            module_map=module_map)
2025-05-07T20:33:41.4818966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4819066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4819143Z E       ^
2025-05-07T20:33:41.4819494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4819498Z 
2025-05-07T20:33:41.4819911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4819915Z 
2025-05-07T20:33:41.4820019Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4820283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4820361Z     T=2048,
2025-05-07T20:33:41.4820441Z     D=7168,
2025-05-07T20:33:41.4820524Z     scale_ub=None,
2025-05-07T20:33:41.4820606Z     contiguous=True,
2025-05-07T20:33:41.4820687Z     compiled=True,
2025-05-07T20:33:41.4820762Z )
2025-05-07T20:33:41.4820978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4821151Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4821155Z 
2025-05-07T20:33:41.4821232Z     @given(
2025-05-07T20:33:41.4821352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4821453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4821566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4821681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4821800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4821878Z     )
2025-05-07T20:33:41.4822119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4822217Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4822292Z         self,
2025-05-07T20:33:41.4822372Z         T: int,
2025-05-07T20:33:41.4822447Z         D: int,
2025-05-07T20:33:41.4822544Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4822641Z         contiguous: bool,
2025-05-07T20:33:41.4822724Z         compiled: bool,
2025-05-07T20:33:41.4822801Z     ) -> None:
2025-05-07T20:33:41.4822897Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4822970Z     
2025-05-07T20:33:41.4823135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4823215Z     
2025-05-07T20:33:41.4823305Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4823427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4823522Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4823647Z         x0 = x[:, :D]
2025-05-07T20:33:41.4823729Z         x1 = x[:, D:]
2025-05-07T20:33:41.4823804Z     
2025-05-07T20:33:41.4823886Z         if contiguous:
2025-05-07T20:33:41.4823985Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4824073Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4824144Z     
2025-05-07T20:33:41.4824234Z         if scale_ub is not None:
2025-05-07T20:33:41.4824338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4824470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4824548Z             )
2025-05-07T20:33:41.4824624Z         else:
2025-05-07T20:33:41.4824717Z             scale_ub_tensor = None
2025-05-07T20:33:41.4824790Z     
2025-05-07T20:33:41.4824918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4825008Z             op = silu_mul_quant
2025-05-07T20:33:41.4825095Z             if compiled:
2025-05-07T20:33:41.4825239Z                 op = torch.compile(op)
2025-05-07T20:33:41.4825350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4825422Z     
2025-05-07T20:33:41.4825512Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4825555Z 
2025-05-07T20:33:41.4825653Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4825780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4825878Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4825976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4826337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.4826429Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.4826919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4827014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4827373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4827639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4827979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4828074Z     kernel = self.compile(
2025-05-07T20:33:41.4828449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4828625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4828752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4828757Z 
2025-05-07T20:33:41.4828959Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa7b4350>
2025-05-07T20:33:41.4829744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4830246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa61f2e0>}
2025-05-07T20:33:41.4830994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4831184Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa7c00f0>
2025-05-07T20:33:41.4831188Z 
2025-05-07T20:33:41.4831350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4831613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4831720Z                            module_map=module_map)
2025-05-07T20:33:41.4831884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4832029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4832107Z E       ^
2025-05-07T20:33:41.4832462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4832469Z 
2025-05-07T20:33:41.4832876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4832881Z 
2025-05-07T20:33:41.4832987Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4833209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4833288Z     T=16384,
2025-05-07T20:33:41.4833371Z     D=5120,
2025-05-07T20:33:41.4833455Z     scale_ub=None,
2025-05-07T20:33:41.4833542Z     contiguous=False,
2025-05-07T20:33:41.4833633Z     compiled=False,
2025-05-07T20:33:41.4833707Z )
2025-05-07T20:33:41.4833964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4834150Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4834154Z 
2025-05-07T20:33:41.4834234Z     @given(
2025-05-07T20:33:41.4834396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4834494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4834607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4834728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4834841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4834915Z     )
2025-05-07T20:33:41.4835160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4835252Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4835333Z         self,
2025-05-07T20:33:41.4835414Z         T: int,
2025-05-07T20:33:41.4835492Z         D: int,
2025-05-07T20:33:41.4835592Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4835740Z         contiguous: bool,
2025-05-07T20:33:41.4835872Z         compiled: bool,
2025-05-07T20:33:41.4835956Z     ) -> None:
2025-05-07T20:33:41.4836052Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4836129Z     
2025-05-07T20:33:41.4836297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4836372Z     
2025-05-07T20:33:41.4836465Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4836592Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4838414Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4838423Z 
2025-05-07T20:33:41.4838543Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.4838550Z 
2025-05-07T20:33:41.4838652Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4838878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4838955Z     T=4096,
2025-05-07T20:33:41.4839033Z     D=7168,
2025-05-07T20:33:41.4839119Z     scale_ub=1200.0,
2025-05-07T20:33:41.4839204Z     contiguous=True,
2025-05-07T20:33:41.4839287Z     compiled=True,
2025-05-07T20:33:41.4839362Z )
2025-05-07T20:33:41.4839578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4839748Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4839753Z 
2025-05-07T20:33:41.4839833Z     @given(
2025-05-07T20:33:41.4839952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4840051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4840220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4840336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4840453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4840527Z     )
2025-05-07T20:33:41.4840768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4840864Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4840940Z         self,
2025-05-07T20:33:41.4841017Z         T: int,
2025-05-07T20:33:41.4841096Z         D: int,
2025-05-07T20:33:41.4841195Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4841285Z         contiguous: bool,
2025-05-07T20:33:41.4841375Z         compiled: bool,
2025-05-07T20:33:41.4841455Z     ) -> None:
2025-05-07T20:33:41.4841550Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4841630Z     
2025-05-07T20:33:41.4841838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4841917Z     
2025-05-07T20:33:41.4842011Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4842134Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4843980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4843986Z 
2025-05-07T20:33:41.4844103Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.4844107Z 
2025-05-07T20:33:41.4844216Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4844497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4844581Z     T=16384,
2025-05-07T20:33:41.4844663Z     D=7168,
2025-05-07T20:33:41.4844749Z     scale_ub=None,
2025-05-07T20:33:41.4844835Z     contiguous=False,
2025-05-07T20:33:41.4844925Z     compiled=False,
2025-05-07T20:33:41.4844998Z )
2025-05-07T20:33:41.4845214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4845387Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4845391Z 
2025-05-07T20:33:41.4845469Z     @given(
2025-05-07T20:33:41.4845589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4845687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4845799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4845918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4846033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4846112Z     )
2025-05-07T20:33:41.4846356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4846449Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4846532Z         self,
2025-05-07T20:33:41.4846608Z         T: int,
2025-05-07T20:33:41.4846686Z         D: int,
2025-05-07T20:33:41.4846789Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4846878Z         contiguous: bool,
2025-05-07T20:33:41.4846964Z         compiled: bool,
2025-05-07T20:33:41.4847045Z     ) -> None:
2025-05-07T20:33:41.4847139Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4847214Z     
2025-05-07T20:33:41.4847384Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4849186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4849237Z 
2025-05-07T20:33:41.4849361Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4849365Z 
2025-05-07T20:33:41.4849466Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4849689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4849766Z     T=2048,
2025-05-07T20:33:41.4849843Z     D=7168,
2025-05-07T20:33:41.4849928Z     scale_ub=1200.0,
2025-05-07T20:33:41.4850012Z     contiguous=True,
2025-05-07T20:33:41.4850095Z     compiled=True,
2025-05-07T20:33:41.4850171Z )
2025-05-07T20:33:41.4850425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4850599Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4850605Z 
2025-05-07T20:33:41.4850689Z     @given(
2025-05-07T20:33:41.4850847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4850952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4851066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4851182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4851298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4851374Z     )
2025-05-07T20:33:41.4851613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4851711Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4851790Z         self,
2025-05-07T20:33:41.4851868Z         T: int,
2025-05-07T20:33:41.4851951Z         D: int,
2025-05-07T20:33:41.4852049Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4852144Z         contiguous: bool,
2025-05-07T20:33:41.4852235Z         compiled: bool,
2025-05-07T20:33:41.4852358Z     ) -> None:
2025-05-07T20:33:41.4852457Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4852532Z     
2025-05-07T20:33:41.4852699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4852778Z     
2025-05-07T20:33:41.4852871Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4852994Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4854778Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4854792Z 
2025-05-07T20:33:41.4854907Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.4854912Z 
2025-05-07T20:33:41.4855019Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4855239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4855319Z     T=2048,
2025-05-07T20:33:41.4855397Z     D=7168,
2025-05-07T20:33:41.4855479Z     scale_ub=None,
2025-05-07T20:33:41.4855568Z     contiguous=True,
2025-05-07T20:33:41.4855652Z     compiled=False,
2025-05-07T20:33:41.4855726Z )
2025-05-07T20:33:41.4855942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4856111Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4856116Z 
2025-05-07T20:33:41.4856193Z     @given(
2025-05-07T20:33:41.4856314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4856415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4856575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4856694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4856809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4856888Z     )
2025-05-07T20:33:41.4857127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4857220Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4857302Z         self,
2025-05-07T20:33:41.4857379Z         T: int,
2025-05-07T20:33:41.4857458Z         D: int,
2025-05-07T20:33:41.4857562Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4857650Z         contiguous: bool,
2025-05-07T20:33:41.4857735Z         compiled: bool,
2025-05-07T20:33:41.4857818Z     ) -> None:
2025-05-07T20:33:41.4857917Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4857992Z     
2025-05-07T20:33:41.4858204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4858281Z     
2025-05-07T20:33:41.4858382Z >       x_sign = torch.sign(x)
2025-05-07T20:33:41.4860158Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4860202Z 
2025-05-07T20:33:41.4860320Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:41.4860325Z 
2025-05-07T20:33:41.4860425Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4860644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4860725Z     T=1,
2025-05-07T20:33:41.4860805Z     D=7168,
2025-05-07T20:33:41.4860929Z     scale_ub=1200.0,
2025-05-07T20:33:41.4861017Z     contiguous=True,
2025-05-07T20:33:41.4861100Z     compiled=False,
2025-05-07T20:33:41.4861177Z )
2025-05-07T20:33:41.4861396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4861559Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4861564Z 
2025-05-07T20:33:41.4861644Z     @given(
2025-05-07T20:33:41.4861760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4861857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4861973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4862087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4862200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4862278Z     )
2025-05-07T20:33:41.4862522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4862624Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4862701Z         self,
2025-05-07T20:33:41.4862779Z         T: int,
2025-05-07T20:33:41.4862861Z         D: int,
2025-05-07T20:33:41.4862959Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4863047Z         contiguous: bool,
2025-05-07T20:33:41.4863134Z         compiled: bool,
2025-05-07T20:33:41.4863212Z     ) -> None:
2025-05-07T20:33:41.4863306Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4863382Z     
2025-05-07T20:33:41.4863550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4863624Z     
2025-05-07T20:33:41.4863718Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4863842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4863933Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4864015Z         x0 = x[:, :D]
2025-05-07T20:33:41.4864095Z         x1 = x[:, D:]
2025-05-07T20:33:41.4864172Z     
2025-05-07T20:33:41.4864258Z         if contiguous:
2025-05-07T20:33:41.4864399Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4864496Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4864570Z     
2025-05-07T20:33:41.4864663Z         if scale_ub is not None:
2025-05-07T20:33:41.4864771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4864906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4864981Z             )
2025-05-07T20:33:41.4865060Z         else:
2025-05-07T20:33:41.4865153Z             scale_ub_tensor = None
2025-05-07T20:33:41.4865230Z     
2025-05-07T20:33:41.4865590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4865726Z             op = silu_mul_quant
2025-05-07T20:33:41.4865847Z             if compiled:
2025-05-07T20:33:41.4865987Z                 op = torch.compile(op)
2025-05-07T20:33:41.4866130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4866239Z     
2025-05-07T20:33:41.4866445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4866453Z 
2025-05-07T20:33:41.4866553Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4866689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4866856Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4866962Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4867461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4867559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4867917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4868138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4868479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4868582Z     kernel = self.compile(
2025-05-07T20:33:41.4869023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4869204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4869336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4869340Z 
2025-05-07T20:33:41.4869545Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa7b7770>
2025-05-07T20:33:41.4870327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4870829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa7f65c0>}
2025-05-07T20:33:41.4871581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4871776Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa530270>
2025-05-07T20:33:41.4871781Z 
2025-05-07T20:33:41.4871943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4872209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4872318Z                            module_map=module_map)
2025-05-07T20:33:41.4872482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4872581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4872658Z E       ^
2025-05-07T20:33:41.4873016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4873020Z 
2025-05-07T20:33:41.4873435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4873498Z 
2025-05-07T20:33:41.4873607Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4873832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4873911Z     T=128,
2025-05-07T20:33:41.4873991Z     D=5120,
2025-05-07T20:33:41.4874074Z     scale_ub=None,
2025-05-07T20:33:41.4874160Z     contiguous=True,
2025-05-07T20:33:41.4874249Z     compiled=False,
2025-05-07T20:33:41.4874323Z )
2025-05-07T20:33:41.4874539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4874710Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4874715Z 
2025-05-07T20:33:41.4874792Z     @given(
2025-05-07T20:33:41.4874914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4875058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4875175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4875301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4875414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4875553Z     )
2025-05-07T20:33:41.4875860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4875954Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4876030Z         self,
2025-05-07T20:33:41.4876112Z         T: int,
2025-05-07T20:33:41.4876189Z         D: int,
2025-05-07T20:33:41.4876292Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4876381Z         contiguous: bool,
2025-05-07T20:33:41.4876467Z         compiled: bool,
2025-05-07T20:33:41.4876548Z     ) -> None:
2025-05-07T20:33:41.4876642Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4876720Z     
2025-05-07T20:33:41.4876889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4876969Z     
2025-05-07T20:33:41.4877062Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4877235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4877326Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4877410Z         x0 = x[:, :D]
2025-05-07T20:33:41.4877497Z         x1 = x[:, D:]
2025-05-07T20:33:41.4877569Z     
2025-05-07T20:33:41.4877653Z         if contiguous:
2025-05-07T20:33:41.4877748Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4877837Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4877912Z     
2025-05-07T20:33:41.4878002Z         if scale_ub is not None:
2025-05-07T20:33:41.4878108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4878243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4878321Z             )
2025-05-07T20:33:41.4878397Z         else:
2025-05-07T20:33:41.4878494Z             scale_ub_tensor = None
2025-05-07T20:33:41.4878569Z     
2025-05-07T20:33:41.4878701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4878802Z             op = silu_mul_quant
2025-05-07T20:33:41.4878889Z             if compiled:
2025-05-07T20:33:41.4878988Z                 op = torch.compile(op)
2025-05-07T20:33:41.4879099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4879173Z     
2025-05-07T20:33:41.4879270Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4879274Z 
2025-05-07T20:33:41.4879375Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4879502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4879608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4879707Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4880201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4880303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4880661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4880932Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4881270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4881365Z     kernel = self.compile(
2025-05-07T20:33:41.4881747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4881921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4882048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4882057Z 
2025-05-07T20:33:41.4882259Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f5b80>
2025-05-07T20:33:41.4883081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4883589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa7f74c0>}
2025-05-07T20:33:41.4884374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4884565Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa5df230>
2025-05-07T20:33:41.4884570Z 
2025-05-07T20:33:41.4884733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4884994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4885104Z                            module_map=module_map)
2025-05-07T20:33:41.4885267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4885409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4885490Z E       ^
2025-05-07T20:33:41.4885841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4885848Z 
2025-05-07T20:33:41.4886262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4886267Z 
2025-05-07T20:33:41.4886369Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4886589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4886671Z     T=128,
2025-05-07T20:33:41.4886747Z     D=7168,
2025-05-07T20:33:41.4886831Z     scale_ub=None,
2025-05-07T20:33:41.4886916Z     contiguous=True,
2025-05-07T20:33:41.4887000Z     compiled=False,
2025-05-07T20:33:41.4887080Z )
2025-05-07T20:33:41.4887299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4887471Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4887475Z 
2025-05-07T20:33:41.4887558Z     @given(
2025-05-07T20:33:41.4887678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4887778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4887895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4888013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4888127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4888202Z     )
2025-05-07T20:33:41.4888443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4888538Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4888615Z         self,
2025-05-07T20:33:41.4888692Z         T: int,
2025-05-07T20:33:41.4888772Z         D: int,
2025-05-07T20:33:41.4888872Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4888964Z         contiguous: bool,
2025-05-07T20:33:41.4889102Z         compiled: bool,
2025-05-07T20:33:41.4889183Z     ) -> None:
2025-05-07T20:33:41.4889279Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4889355Z     
2025-05-07T20:33:41.4889532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4889610Z     
2025-05-07T20:33:41.4889702Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4889827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4889923Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4890006Z         x0 = x[:, :D]
2025-05-07T20:33:41.4890086Z         x1 = x[:, D:]
2025-05-07T20:33:41.4890161Z     
2025-05-07T20:33:41.4890246Z         if contiguous:
2025-05-07T20:33:41.4890336Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4890430Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4890502Z     
2025-05-07T20:33:41.4890593Z         if scale_ub is not None:
2025-05-07T20:33:41.4890747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4890884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4890965Z             )
2025-05-07T20:33:41.4891042Z         else:
2025-05-07T20:33:41.4891177Z             scale_ub_tensor = None
2025-05-07T20:33:41.4891253Z     
2025-05-07T20:33:41.4891382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4891472Z             op = silu_mul_quant
2025-05-07T20:33:41.4891560Z             if compiled:
2025-05-07T20:33:41.4891658Z                 op = torch.compile(op)
2025-05-07T20:33:41.4891762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4891838Z     
2025-05-07T20:33:41.4891928Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4891933Z 
2025-05-07T20:33:41.4892030Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4892161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4892260Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4892369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4892903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4893005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4893362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4893581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4893917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4894015Z     kernel = self.compile(
2025-05-07T20:33:41.4894392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4894568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4894699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4894706Z 
2025-05-07T20:33:41.4894911Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f4fb0>
2025-05-07T20:33:41.4895688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4896190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa168540>}
2025-05-07T20:33:41.4896934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4897124Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa11f230>
2025-05-07T20:33:41.4897129Z 
2025-05-07T20:33:41.4897298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4897604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4897713Z                            module_map=module_map)
2025-05-07T20:33:41.4897876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4897975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4898053Z E       ^
2025-05-07T20:33:41.4898410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4898414Z 
2025-05-07T20:33:41.4898822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4898826Z 
2025-05-07T20:33:41.4898932Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4899154Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4899273Z     T=2048,
2025-05-07T20:33:41.4899357Z     D=7168,
2025-05-07T20:33:41.4899444Z     scale_ub=1200.0,
2025-05-07T20:33:41.4899529Z     contiguous=True,
2025-05-07T20:33:41.4899658Z     compiled=False,
2025-05-07T20:33:41.4899732Z )
2025-05-07T20:33:41.4899948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4900126Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4900130Z 
2025-05-07T20:33:41.4900208Z     @given(
2025-05-07T20:33:41.4900331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4900429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4900544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4900664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4900778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4900853Z     )
2025-05-07T20:33:41.4901104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4901238Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4901322Z         self,
2025-05-07T20:33:41.4901402Z         T: int,
2025-05-07T20:33:41.4901483Z         D: int,
2025-05-07T20:33:41.4901585Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4901674Z         contiguous: bool,
2025-05-07T20:33:41.4901759Z         compiled: bool,
2025-05-07T20:33:41.4901841Z     ) -> None:
2025-05-07T20:33:41.4901935Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4902009Z     
2025-05-07T20:33:41.4902182Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4903980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4903990Z 
2025-05-07T20:33:41.4904110Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4904114Z 
2025-05-07T20:33:41.4904217Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4904440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4904519Z     T=1,
2025-05-07T20:33:41.4904595Z     D=5120,
2025-05-07T20:33:41.4904686Z     scale_ub=1200.0,
2025-05-07T20:33:41.4904772Z     contiguous=True,
2025-05-07T20:33:41.4904855Z     compiled=False,
2025-05-07T20:33:41.4904932Z )
2025-05-07T20:33:41.4905147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4905310Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4905317Z 
2025-05-07T20:33:41.4905441Z     @given(
2025-05-07T20:33:41.4905562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4905660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4905784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4905898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4906014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4906088Z     )
2025-05-07T20:33:41.4906328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4906423Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4906499Z         self,
2025-05-07T20:33:41.4906575Z         T: int,
2025-05-07T20:33:41.4906658Z         D: int,
2025-05-07T20:33:41.4906758Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4906845Z         contiguous: bool,
2025-05-07T20:33:41.4906933Z         compiled: bool,
2025-05-07T20:33:41.4907012Z     ) -> None:
2025-05-07T20:33:41.4907175Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4907255Z     
2025-05-07T20:33:41.4907422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4907537Z     
2025-05-07T20:33:41.4907629Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4907754Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4907842Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4907923Z         x0 = x[:, :D]
2025-05-07T20:33:41.4908002Z         x1 = x[:, D:]
2025-05-07T20:33:41.4908078Z     
2025-05-07T20:33:41.4908161Z         if contiguous:
2025-05-07T20:33:41.4908255Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4908346Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4908419Z     
2025-05-07T20:33:41.4908509Z         if scale_ub is not None:
2025-05-07T20:33:41.4908617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4908749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4908831Z             )
2025-05-07T20:33:41.4908909Z         else:
2025-05-07T20:33:41.4909046Z             scale_ub_tensor = None
2025-05-07T20:33:41.4909130Z     
2025-05-07T20:33:41.4909261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4909355Z             op = silu_mul_quant
2025-05-07T20:33:41.4909448Z             if compiled:
2025-05-07T20:33:41.4909545Z                 op = torch.compile(op)
2025-05-07T20:33:41.4909651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4909727Z     
2025-05-07T20:33:41.4909818Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4909822Z 
2025-05-07T20:33:41.4909918Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4910049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4910148Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4910250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4910747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4910849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4911210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4911433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4911775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4911868Z     kernel = self.compile(
2025-05-07T20:33:41.4912245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4912422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4912547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4912551Z 
2025-05-07T20:33:41.4912759Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa5f66f0>
2025-05-07T20:33:41.4913581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4914084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa169b20>}
2025-05-07T20:33:41.4914826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4915016Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa1dbbf0>
2025-05-07T20:33:41.4915020Z 
2025-05-07T20:33:41.4915187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4915488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4915600Z                            module_map=module_map)
2025-05-07T20:33:41.4915809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4915951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4916029Z E       ^
2025-05-07T20:33:41.4916385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4916389Z 
2025-05-07T20:33:41.4916796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4916801Z 
2025-05-07T20:33:41.4916906Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4917128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4917205Z     T=2048,
2025-05-07T20:33:41.4917285Z     D=5120,
2025-05-07T20:33:41.4917368Z     scale_ub=None,
2025-05-07T20:33:41.4917455Z     contiguous=True,
2025-05-07T20:33:41.4917586Z     compiled=False,
2025-05-07T20:33:41.4917660Z )
2025-05-07T20:33:41.4917880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4918055Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4918059Z 
2025-05-07T20:33:41.4918138Z     @given(
2025-05-07T20:33:41.4918258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4918357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4918471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4918589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4918702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4918776Z     )
2025-05-07T20:33:41.4919020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4919114Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4919198Z         self,
2025-05-07T20:33:41.4919278Z         T: int,
2025-05-07T20:33:41.4919358Z         D: int,
2025-05-07T20:33:41.4919460Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4919555Z         contiguous: bool,
2025-05-07T20:33:41.4919639Z         compiled: bool,
2025-05-07T20:33:41.4919722Z     ) -> None:
2025-05-07T20:33:41.4919815Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4919889Z     
2025-05-07T20:33:41.4920057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4920131Z     
2025-05-07T20:33:41.4920223Z >       x_sign = torch.sign(x)
2025-05-07T20:33:41.4922022Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4922074Z 
2025-05-07T20:33:41.4922192Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:41.4922201Z 
2025-05-07T20:33:41.4922304Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4922526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4922612Z     T=16384,
2025-05-07T20:33:41.4922688Z     D=5120,
2025-05-07T20:33:41.4922770Z     scale_ub=None,
2025-05-07T20:33:41.4922856Z     contiguous=True,
2025-05-07T20:33:41.4922939Z     compiled=False,
2025-05-07T20:33:41.4923014Z )
2025-05-07T20:33:41.4923232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4923405Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4923409Z 
2025-05-07T20:33:41.4923529Z     @given(
2025-05-07T20:33:41.4923657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4923756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4923916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4924031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4924145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4927325Z     )
2025-05-07T20:33:41.4927587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4927681Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4927763Z         self,
2025-05-07T20:33:41.4927840Z         T: int,
2025-05-07T20:33:41.4927916Z         D: int,
2025-05-07T20:33:41.4928019Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4928108Z         contiguous: bool,
2025-05-07T20:33:41.4928193Z         compiled: bool,
2025-05-07T20:33:41.4928277Z     ) -> None:
2025-05-07T20:33:41.4928375Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4928450Z     
2025-05-07T20:33:41.4928686Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4930482Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4930495Z 
2025-05-07T20:33:41.4930611Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4930616Z 
2025-05-07T20:33:41.4930718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4930942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4931022Z     T=4096,
2025-05-07T20:33:41.4931103Z     D=5120,
2025-05-07T20:33:41.4931191Z     scale_ub=None,
2025-05-07T20:33:41.4931275Z     contiguous=True,
2025-05-07T20:33:41.4931360Z     compiled=False,
2025-05-07T20:33:41.4931438Z )
2025-05-07T20:33:41.4931656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4931825Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4931833Z 
2025-05-07T20:33:41.4931910Z     @given(
2025-05-07T20:33:41.4932027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4932128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4932242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4932357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4932471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4932546Z     )
2025-05-07T20:33:41.4932792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4932936Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4933012Z         self,
2025-05-07T20:33:41.4933093Z         T: int,
2025-05-07T20:33:41.4933169Z         D: int,
2025-05-07T20:33:41.4933266Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4933362Z         contiguous: bool,
2025-05-07T20:33:41.4933447Z         compiled: bool,
2025-05-07T20:33:41.4933526Z     ) -> None:
2025-05-07T20:33:41.4933624Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4933696Z     
2025-05-07T20:33:41.4933863Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4935687Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4935730Z 
2025-05-07T20:33:41.4935846Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4935851Z 
2025-05-07T20:33:41.4935954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4936175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4936257Z     T=2048,
2025-05-07T20:33:41.4936335Z     D=5120,
2025-05-07T20:33:41.4936418Z     scale_ub=None,
2025-05-07T20:33:41.4936507Z     contiguous=False,
2025-05-07T20:33:41.4936597Z     compiled=False,
2025-05-07T20:33:41.4936669Z )
2025-05-07T20:33:41.4936885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4937060Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4937066Z 
2025-05-07T20:33:41.4937186Z     @given(
2025-05-07T20:33:41.4937306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4937407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4937521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4937638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4937751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4937831Z     )
2025-05-07T20:33:41.4938072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4938167Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4938246Z         self,
2025-05-07T20:33:41.4938322Z         T: int,
2025-05-07T20:33:41.4938399Z         D: int,
2025-05-07T20:33:41.4938500Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4938587Z         contiguous: bool,
2025-05-07T20:33:41.4938671Z         compiled: bool,
2025-05-07T20:33:41.4938757Z     ) -> None:
2025-05-07T20:33:41.4938855Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4938932Z     
2025-05-07T20:33:41.4939104Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4940890Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4940898Z 
2025-05-07T20:33:41.4941014Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4941018Z 
2025-05-07T20:33:41.4941121Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4941348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4941496Z     T=4096,
2025-05-07T20:33:41.4941573Z     D=7168,
2025-05-07T20:33:41.4941662Z     scale_ub=None,
2025-05-07T20:33:41.4941746Z     contiguous=True,
2025-05-07T20:33:41.4941829Z     compiled=True,
2025-05-07T20:33:41.4941907Z )
2025-05-07T20:33:41.4942122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4942291Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.4942299Z 
2025-05-07T20:33:41.4942376Z     @given(
2025-05-07T20:33:41.4942494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4942595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4942711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4942826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4942986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4943067Z     )
2025-05-07T20:33:41.4943311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4943407Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4943531Z         self,
2025-05-07T20:33:41.4943611Z         T: int,
2025-05-07T20:33:41.4943687Z         D: int,
2025-05-07T20:33:41.4943784Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4943876Z         contiguous: bool,
2025-05-07T20:33:41.4943961Z         compiled: bool,
2025-05-07T20:33:41.4944038Z     ) -> None:
2025-05-07T20:33:41.4944135Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4944209Z     
2025-05-07T20:33:41.4944374Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4946206Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4946216Z 
2025-05-07T20:33:41.4946332Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4946336Z 
2025-05-07T20:33:41.4946443Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4946662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4946742Z     T=2048,
2025-05-07T20:33:41.4946819Z     D=5120,
2025-05-07T20:33:41.4946904Z     scale_ub=1200.0,
2025-05-07T20:33:41.4946992Z     contiguous=False,
2025-05-07T20:33:41.4947074Z     compiled=False,
2025-05-07T20:33:41.4947146Z )
2025-05-07T20:33:41.4947364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4947540Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4947546Z 
2025-05-07T20:33:41.4947622Z     @given(
2025-05-07T20:33:41.4947748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4947846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4947958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4948076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4948188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4948270Z     )
2025-05-07T20:33:41.4948509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4948603Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4948682Z         self,
2025-05-07T20:33:41.4948757Z         T: int,
2025-05-07T20:33:41.4948833Z         D: int,
2025-05-07T20:33:41.4948934Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4949025Z         contiguous: bool,
2025-05-07T20:33:41.4949108Z         compiled: bool,
2025-05-07T20:33:41.4949309Z     ) -> None:
2025-05-07T20:33:41.4949407Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4949480Z     
2025-05-07T20:33:41.4949649Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4951427Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4951435Z 
2025-05-07T20:33:41.4951549Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4951593Z 
2025-05-07T20:33:41.4951697Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4951922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4952037Z     T=4096,
2025-05-07T20:33:41.4952112Z     D=7168,
2025-05-07T20:33:41.4952200Z     scale_ub=1200.0,
2025-05-07T20:33:41.4952283Z     contiguous=True,
2025-05-07T20:33:41.4952364Z     compiled=False,
2025-05-07T20:33:41.4952444Z )
2025-05-07T20:33:41.4952656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4952826Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4952835Z 
2025-05-07T20:33:41.4952912Z     @given(
2025-05-07T20:33:41.4953026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4953125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4953236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4953353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4953509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4953583Z     )
2025-05-07T20:33:41.4953824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4953921Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4953999Z         self,
2025-05-07T20:33:41.4954078Z         T: int,
2025-05-07T20:33:41.4954153Z         D: int,
2025-05-07T20:33:41.4954248Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4954339Z         contiguous: bool,
2025-05-07T20:33:41.4954423Z         compiled: bool,
2025-05-07T20:33:41.4954500Z     ) -> None:
2025-05-07T20:33:41.4954599Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4954670Z     
2025-05-07T20:33:41.4954836Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4956689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4956700Z 
2025-05-07T20:33:41.4956817Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4956822Z 
2025-05-07T20:33:41.4956925Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4957143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4957223Z     T=16384,
2025-05-07T20:33:41.4957297Z     D=7168,
2025-05-07T20:33:41.4957378Z     scale_ub=None,
2025-05-07T20:33:41.4957464Z     contiguous=False,
2025-05-07T20:33:41.4957545Z     compiled=True,
2025-05-07T20:33:41.4957617Z )
2025-05-07T20:33:41.4957833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4958056Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.4958063Z 
2025-05-07T20:33:41.4958144Z     @given(
2025-05-07T20:33:41.4958267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4958363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4958475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4958593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4958707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4958781Z     )
2025-05-07T20:33:41.4959020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4959112Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4959193Z         self,
2025-05-07T20:33:41.4959268Z         T: int,
2025-05-07T20:33:41.4959344Z         D: int,
2025-05-07T20:33:41.4959486Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4959578Z         contiguous: bool,
2025-05-07T20:33:41.4959666Z         compiled: bool,
2025-05-07T20:33:41.4959748Z     ) -> None:
2025-05-07T20:33:41.4959882Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4959953Z     
2025-05-07T20:33:41.4960123Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4961904Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4961913Z 
2025-05-07T20:33:41.4962030Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4962075Z 
2025-05-07T20:33:41.4962176Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4962398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4962479Z     T=4096,
2025-05-07T20:33:41.4962555Z     D=7168,
2025-05-07T20:33:41.4962642Z     scale_ub=None,
2025-05-07T20:33:41.4962724Z     contiguous=True,
2025-05-07T20:33:41.4962806Z     compiled=False,
2025-05-07T20:33:41.4962881Z )
2025-05-07T20:33:41.4963094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4963263Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4963267Z 
2025-05-07T20:33:41.4963344Z     @given(
2025-05-07T20:33:41.4963460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4963564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4963680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4963799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4963914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4963988Z     )
2025-05-07T20:33:41.4964227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4964322Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4964397Z         self,
2025-05-07T20:33:41.4964477Z         T: int,
2025-05-07T20:33:41.4964552Z         D: int,
2025-05-07T20:33:41.4964647Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4964737Z         contiguous: bool,
2025-05-07T20:33:41.4964822Z         compiled: bool,
2025-05-07T20:33:41.4964898Z     ) -> None:
2025-05-07T20:33:41.4964996Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4965068Z     
2025-05-07T20:33:41.4965232Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4967390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4967491Z 
2025-05-07T20:33:41.4967611Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4967616Z 
2025-05-07T20:33:41.4967718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4967938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4968018Z     T=16384,
2025-05-07T20:33:41.4968094Z     D=7168,
2025-05-07T20:33:41.4968173Z     scale_ub=None,
2025-05-07T20:33:41.4968262Z     contiguous=True,
2025-05-07T20:33:41.4968406Z     compiled=False,
2025-05-07T20:33:41.4968482Z )
2025-05-07T20:33:41.4968701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4968873Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:41.4968937Z 
2025-05-07T20:33:41.4969014Z     @given(
2025-05-07T20:33:41.4969132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4969228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4969340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4969459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4969569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4969645Z     )
2025-05-07T20:33:41.4969884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4969976Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4970055Z         self,
2025-05-07T20:33:41.4970135Z         T: int,
2025-05-07T20:33:41.4970213Z         D: int,
2025-05-07T20:33:41.4970396Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4970488Z         contiguous: bool,
2025-05-07T20:33:41.4970572Z         compiled: bool,
2025-05-07T20:33:41.4970655Z     ) -> None:
2025-05-07T20:33:41.4970749Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4970820Z     
2025-05-07T20:33:41.4970988Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4972771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4972781Z 
2025-05-07T20:33:41.4972898Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4972903Z 
2025-05-07T20:33:41.4973004Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4973224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4973299Z     T=16384,
2025-05-07T20:33:41.4973375Z     D=7168,
2025-05-07T20:33:41.4973460Z     scale_ub=1200.0,
2025-05-07T20:33:41.4973544Z     contiguous=True,
2025-05-07T20:33:41.4973627Z     compiled=False,
2025-05-07T20:33:41.4973707Z )
2025-05-07T20:33:41.4973923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4974100Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.4974104Z 
2025-05-07T20:33:41.4974182Z     @given(
2025-05-07T20:33:41.4974299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4974404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4974565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4974684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4974800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4974874Z     )
2025-05-07T20:33:41.4975112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4975209Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4975284Z         self,
2025-05-07T20:33:41.4975361Z         T: int,
2025-05-07T20:33:41.4975435Z         D: int,
2025-05-07T20:33:41.4975532Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4975624Z         contiguous: bool,
2025-05-07T20:33:41.4975709Z         compiled: bool,
2025-05-07T20:33:41.4975786Z     ) -> None:
2025-05-07T20:33:41.4975884Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4975957Z     
2025-05-07T20:33:41.4976121Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4977953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4978000Z 
2025-05-07T20:33:41.4978115Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4978119Z 
2025-05-07T20:33:41.4978224Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4978442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4978527Z     T=128,
2025-05-07T20:33:41.4978604Z     D=5120,
2025-05-07T20:33:41.4978692Z     scale_ub=1200.0,
2025-05-07T20:33:41.4978782Z     contiguous=False,
2025-05-07T20:33:41.4978904Z     compiled=False,
2025-05-07T20:33:41.4978980Z )
2025-05-07T20:33:41.4979198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4979371Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.4979376Z 
2025-05-07T20:33:41.4979451Z     @given(
2025-05-07T20:33:41.4979571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4979667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4979781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4979897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4980009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4980085Z     )
2025-05-07T20:33:41.4980324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4980419Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4980498Z         self,
2025-05-07T20:33:41.4980577Z         T: int,
2025-05-07T20:33:41.4980656Z         D: int,
2025-05-07T20:33:41.4980757Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4980848Z         contiguous: bool,
2025-05-07T20:33:41.4980933Z         compiled: bool,
2025-05-07T20:33:41.4981013Z     ) -> None:
2025-05-07T20:33:41.4981106Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4981177Z     
2025-05-07T20:33:41.4981345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4981418Z     
2025-05-07T20:33:41.4981516Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4981640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4981730Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4981815Z         x0 = x[:, :D]
2025-05-07T20:33:41.4981896Z         x1 = x[:, D:]
2025-05-07T20:33:41.4981967Z     
2025-05-07T20:33:41.4982054Z         if contiguous:
2025-05-07T20:33:41.4982148Z             x0 = x0.contiguous()
2025-05-07T20:33:41.4982238Z             x1 = x1.contiguous()
2025-05-07T20:33:41.4982364Z     
2025-05-07T20:33:41.4982455Z         if scale_ub is not None:
2025-05-07T20:33:41.4982559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.4982700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.4982776Z             )
2025-05-07T20:33:41.4982857Z         else:
2025-05-07T20:33:41.4982951Z             scale_ub_tensor = None
2025-05-07T20:33:41.4983024Z     
2025-05-07T20:33:41.4983155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.4983248Z             op = silu_mul_quant
2025-05-07T20:33:41.4983331Z             if compiled:
2025-05-07T20:33:41.4983433Z                 op = torch.compile(op)
2025-05-07T20:33:41.4983537Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4983608Z     
2025-05-07T20:33:41.4983703Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.4983708Z 
2025-05-07T20:33:41.4983847Z moe/activation_test.py:117: 
2025-05-07T20:33:41.4983984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4984084Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.4984220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.4984723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.4984823Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.4985182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.4985407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.4985745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.4985844Z     kernel = self.compile(
2025-05-07T20:33:41.4986226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.4986440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.4986572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.4986579Z 
2025-05-07T20:33:41.4986784Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa3ff0b0>
2025-05-07T20:33:41.4987564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.4988068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa2b4860>}
2025-05-07T20:33:41.4988818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.4989017Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa2a3730>
2025-05-07T20:33:41.4989024Z 
2025-05-07T20:33:41.4989186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.4989451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.4989560Z                            module_map=module_map)
2025-05-07T20:33:41.4989720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.4989821Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.4989897Z E       ^
2025-05-07T20:33:41.4990258Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.4990263Z 
2025-05-07T20:33:41.4990676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.4990723Z 
2025-05-07T20:33:41.4990827Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4991055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4991133Z     T=2048,
2025-05-07T20:33:41.4991209Z     D=7168,
2025-05-07T20:33:41.4991293Z     scale_ub=None,
2025-05-07T20:33:41.4991380Z     contiguous=False,
2025-05-07T20:33:41.4991466Z     compiled=False,
2025-05-07T20:33:41.4991538Z )
2025-05-07T20:33:41.4991753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4991931Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:41.4991935Z 
2025-05-07T20:33:41.4992012Z     @given(
2025-05-07T20:33:41.4992129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4992231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4992345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4992503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4992624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4992698Z     )
2025-05-07T20:33:41.4992984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4993076Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4993153Z         self,
2025-05-07T20:33:41.4993233Z         T: int,
2025-05-07T20:33:41.4993308Z         D: int,
2025-05-07T20:33:41.4993406Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4993497Z         contiguous: bool,
2025-05-07T20:33:41.4993580Z         compiled: bool,
2025-05-07T20:33:41.4993657Z     ) -> None:
2025-05-07T20:33:41.4993753Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4993825Z     
2025-05-07T20:33:41.4993994Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4995887Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.4995899Z 
2025-05-07T20:33:41.4996020Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.4996024Z 
2025-05-07T20:33:41.4996128Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.4996350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.4996435Z     T=128,
2025-05-07T20:33:41.4996512Z     D=7168,
2025-05-07T20:33:41.4996599Z     scale_ub=1200.0,
2025-05-07T20:33:41.4996687Z     contiguous=True,
2025-05-07T20:33:41.4996771Z     compiled=True,
2025-05-07T20:33:41.4996846Z )
2025-05-07T20:33:41.4997069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.4997237Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.4997243Z 
2025-05-07T20:33:41.4997327Z     @given(
2025-05-07T20:33:41.4997444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.4997546Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.4997665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.4997782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.4997894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.4997972Z     )
2025-05-07T20:33:41.4998214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.4998308Z     def test_silu_mul_quant(
2025-05-07T20:33:41.4998392Z         self,
2025-05-07T20:33:41.4998470Z         T: int,
2025-05-07T20:33:41.4998553Z         D: int,
2025-05-07T20:33:41.4998657Z         scale_ub: Optional[float],
2025-05-07T20:33:41.4998796Z         contiguous: bool,
2025-05-07T20:33:41.4998886Z         compiled: bool,
2025-05-07T20:33:41.4998970Z     ) -> None:
2025-05-07T20:33:41.4999067Z         torch.manual_seed(2025)
2025-05-07T20:33:41.4999144Z     
2025-05-07T20:33:41.4999309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.4999385Z     
2025-05-07T20:33:41.4999479Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.4999604Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.4999693Z         x = x_sign * x_clamp
2025-05-07T20:33:41.4999778Z         x0 = x[:, :D]
2025-05-07T20:33:41.4999859Z         x1 = x[:, D:]
2025-05-07T20:33:41.4999932Z     
2025-05-07T20:33:41.5000024Z         if contiguous:
2025-05-07T20:33:41.5000115Z             x0 = x0.contiguous()
2025-05-07T20:33:41.5000207Z             x1 = x1.contiguous()
2025-05-07T20:33:41.5000281Z     
2025-05-07T20:33:41.5000419Z         if scale_ub is not None:
2025-05-07T20:33:41.5000532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.5000666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.5000810Z             )
2025-05-07T20:33:41.5000891Z         else:
2025-05-07T20:33:41.5000985Z             scale_ub_tensor = None
2025-05-07T20:33:41.5001058Z     
2025-05-07T20:33:41.5001192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.5001283Z             op = silu_mul_quant
2025-05-07T20:33:41.5001370Z             if compiled:
2025-05-07T20:33:41.5001470Z                 op = torch.compile(op)
2025-05-07T20:33:41.5001574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.5001651Z     
2025-05-07T20:33:41.5001742Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.5001747Z 
2025-05-07T20:33:41.5001844Z moe/activation_test.py:117: 
2025-05-07T20:33:41.5001975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.5002077Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.5002219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.5002590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.5002687Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.5003175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.5003280Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.5003634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.5003858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.5004193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.5004287Z     kernel = self.compile(
2025-05-07T20:33:41.5004674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.5004849Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.5004985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.5004990Z 
2025-05-07T20:33:41.5005194Z self = <triton.compiler.compiler.ASTSource object at 0x7f2bfa067dd0>
2025-05-07T20:33:41.5005967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.5006472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f2cf8f31440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f2bfa2b59e0>}
2025-05-07T20:33:41.5007220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.5007465Z context = <triton._C.libtriton.ir.context object at 0x7f2bfa06ca30>
2025-05-07T20:33:41.5007473Z 
2025-05-07T20:33:41.5007637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.5007900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.5008010Z                            module_map=module_map)
2025-05-07T20:33:41.5008172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.5008275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.5008353Z E       ^
2025-05-07T20:33:41.5008707Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.5008712Z 
2025-05-07T20:33:41.5009169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.5009176Z 
2025-05-07T20:33:41.5009279Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.5009543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.5009624Z     T=128,
2025-05-07T20:33:41.5009702Z     D=7168,
2025-05-07T20:33:41.5009792Z     scale_ub=1200.0,
2025-05-07T20:33:41.5009877Z     contiguous=True,
2025-05-07T20:33:41.5009961Z     compiled=False,
2025-05-07T20:33:41.5010039Z )
2025-05-07T20:33:41.5010253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.5010424Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:41.5010428Z 
2025-05-07T20:33:41.5010510Z     @given(
2025-05-07T20:33:41.5010628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.5010729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.5010852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.5011012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.5011130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.5011207Z     )
2025-05-07T20:33:41.5011447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.5011544Z     def test_silu_mul_quant(
2025-05-07T20:33:41.5011622Z         self,
2025-05-07T20:33:41.5011698Z         T: int,
2025-05-07T20:33:41.5011775Z         D: int,
2025-05-07T20:33:41.5011872Z         scale_ub: Optional[float],
2025-05-07T20:33:41.5011959Z         contiguous: bool,
2025-05-07T20:33:41.5012048Z         compiled: bool,
2025-05-07T20:33:41.5012125Z     ) -> None:
2025-05-07T20:33:41.5012220Z         torch.manual_seed(2025)
2025-05-07T20:33:41.5012292Z     
2025-05-07T20:33:41.5012459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.5012534Z     
2025-05-07T20:33:41.5012628Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.5012755Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.5014542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.5014550Z 
2025-05-07T20:33:41.5014667Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.5014672Z 
2025-05-07T20:33:41.5014777Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.5015000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.5015079Z     T=128,
2025-05-07T20:33:41.5015205Z     D=5120,
2025-05-07T20:33:41.5015289Z     scale_ub=1200.0,
2025-05-07T20:33:41.5015379Z     contiguous=True,
2025-05-07T20:33:41.5015465Z     compiled=True,
2025-05-07T20:33:41.5015539Z )
2025-05-07T20:33:41.5015757Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.5015923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.5015928Z 
2025-05-07T20:33:41.5016006Z     @given(
2025-05-07T20:33:41.5016128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.5016228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.5016343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.5016463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.5016577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.5016659Z     )
2025-05-07T20:33:41.5016942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.5017045Z     def test_silu_mul_quant(
2025-05-07T20:33:41.5017126Z         self,
2025-05-07T20:33:41.5017203Z         T: int,
2025-05-07T20:33:41.5017324Z         D: int,
2025-05-07T20:33:41.5017425Z         scale_ub: Optional[float],
2025-05-07T20:33:41.5017514Z         contiguous: bool,
2025-05-07T20:33:41.5017599Z         compiled: bool,
2025-05-07T20:33:41.5017681Z     ) -> None:
2025-05-07T20:33:41.5017777Z         torch.manual_seed(2025)
2025-05-07T20:33:41.5017850Z     
2025-05-07T20:33:41.5018020Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.5018093Z     
2025-05-07T20:33:41.5018188Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.5018311Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.5020127Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.5020142Z 
2025-05-07T20:33:41.5020260Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:41.5020265Z 
2025-05-07T20:33:41.5020366Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.5020591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.5020669Z     T=128,
2025-05-07T20:33:41.5020748Z     D=7168,
2025-05-07T20:33:41.5020832Z     scale_ub=None,
2025-05-07T20:33:41.5020918Z     contiguous=True,
2025-05-07T20:33:41.5021001Z     compiled=True,
2025-05-07T20:33:41.5021077Z )
2025-05-07T20:33:41.5021295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.5021469Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.5021474Z 
2025-05-07T20:33:41.5021554Z     @given(
2025-05-07T20:33:41.5021673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.5021773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.5021886Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.5022001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.5022116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.5022190Z     )
2025-05-07T20:33:41.5022431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.5022527Z     def test_silu_mul_quant(
2025-05-07T20:33:41.5022604Z         self,
2025-05-07T20:33:41.5022683Z         T: int,
2025-05-07T20:33:41.5022762Z         D: int,
2025-05-07T20:33:41.5022861Z         scale_ub: Optional[float],
2025-05-07T20:33:41.5022952Z         contiguous: bool,
2025-05-07T20:33:41.5023087Z         compiled: bool,
2025-05-07T20:33:41.5023165Z     ) -> None:
2025-05-07T20:33:41.5023262Z         torch.manual_seed(2025)
2025-05-07T20:33:41.5023339Z     
2025-05-07T20:33:41.5023504Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.5025283Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:41.5025289Z 
2025-05-07T20:33:41.5025445Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:41.5025587Z =============================== warnings summary ===============================
2025-05-07T20:33:41.5025891Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:41.5026234Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:41.5026529Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:41.5027402Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:41.5027636Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:41.5027640Z 
2025-05-07T20:33:41.5027853Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:41.5028063Z ================= 1 failed, 1 deselected, 3 warnings in 13.15s =================
2025-05-07T20:33:43.0829866Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:43.1444164Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:43.1444423Z 
2025-05-07T20:33:43.1444603Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:43.1445177Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:43.1445590Z 
2025-05-07T20:33:43.1445594Z 
2025-05-07T20:33:43.1445598Z 
2025-05-07T20:33:43.1462426Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:43.1552191Z Post job cleanup.
2025-05-07T20:33:43.2538980Z [command]/usr/bin/git version
2025-05-07T20:33:43.2581385Z git version 2.47.1
2025-05-07T20:33:43.2616152Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/db22bca0-6ceb-4a34-9559-aee67b9a86bd/.gitconfig'
2025-05-07T20:33:43.2626608Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/db22bca0-6ceb-4a34-9559-aee67b9a86bd' before making global git config changes
2025-05-07T20:33:43.2627464Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:43.2640915Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:43.2682647Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:43.2717213Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:43.3054766Z Entering 'external/asmjit'
2025-05-07T20:33:43.3121468Z Entering 'external/composable_kernel'
2025-05-07T20:33:43.3195030Z Entering 'external/cpuinfo'
2025-05-07T20:33:43.3259952Z Entering 'external/cutlass'
2025-05-07T20:33:43.3336969Z Entering 'external/googletest'
2025-05-07T20:33:43.3404196Z Entering 'external/hipify_torch'
2025-05-07T20:33:43.3469523Z Entering 'external/json'
2025-05-07T20:33:43.3559248Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:43.3582467Z http.https://github.com/.extraheader
2025-05-07T20:33:43.3593728Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:43.3624637Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:43.3953627Z Entering 'external/asmjit'
2025-05-07T20:33:43.3996634Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4039330Z Entering 'external/composable_kernel'
2025-05-07T20:33:43.4084203Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4132537Z Entering 'external/cpuinfo'
2025-05-07T20:33:43.4175374Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4218404Z Entering 'external/cutlass'
2025-05-07T20:33:43.4261693Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4314964Z Entering 'external/googletest'
2025-05-07T20:33:43.4357879Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4401462Z Entering 'external/hipify_torch'
2025-05-07T20:33:43.4444699Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4488487Z Entering 'external/json'
2025-05-07T20:33:43.4531917Z http.https://github.com/.extraheader
2025-05-07T20:33:43.4685862Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:43.4719486Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:43.4729863Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:43.4730235Z ##[endgroup]
2025-05-07T20:33:43.4827292Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:54.2269793Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:34:10.6241936Z Cleaning up orphan processes